You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/10/21 08:15:18 UTC

[GitHub] [iceberg] Fokko commented on pull request #6010: Python: Fix caching of the PyArrowFileIO

Fokko commented on PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#issuecomment-1286621641

   Sorry for the limited context. I'm working on converting the files into a PyArrow Dataset. This requires passing in a single filesystem and a list of files. The files-paths can't have a scheme, since that will have PyArrow throw an error. The idea behind it is that the S3FileSystem already indicates that it is an S3 path.
   
   By splitting this we can re-use this logic to pass the list of files to the Dataset:
   
   ```python
   io = self.table.io()
   if isinstance(io, FsspecFileIO):
       ...
   elif isinstance(io, PyArrowFileIO):
       # We should not use internal methods
       fs = io._get_fs_and_path(files[0])[0]
       # This is also awkward, PyArrow requires removing the s3a://
       files = ["".join(urlparse(file)[1:3]) for file in files]
   else:
       raise ValueError(f"Unsupported FileSystem: {io}")
   ```
   
   Convert it into:
   ```python
   io = self.table.io()
   if isinstance(io, FsspecFileIO):
       ...
   elif isinstance(io, PyArrowFileIO):
       normalized_files = map(PyArrowFileIO.normalize_location, files)
       fs = io.get_fs(next(files).scheme)
       files = [file.path for file in normalized_files]
   else:
       raise ValueError(f"Unsupported FileSystem: {io}")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org