You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/04 05:54:45 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6010: Python: Fix PyArrowFileIO caching

Fokko opened a new pull request, #6010:
URL: https://github.com/apache/iceberg/pull/6010

   The path was part of the lru_cache call, meaning that for each different file it would miss the cache. Also refactored the path/location structure since PyArrow requires URLs without a scheme.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1013633552


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -223,8 +236,8 @@ def new_input(self, location: str) -> PyArrowFile:
         Returns:
             PyArrowFile: A PyArrowFile instance for the given location
         """
-        fs, path = self.get_fs_and_path(location)
-        return PyArrowFile(fs=fs, location=location, path=path)
+        fs = self.get_fs(self.extract_scheme(location))
+        return PyArrowFile(fs=fs, location=location)

Review Comment:
   My main argument here is, that if you want to use the PyArrowFile on its own, then the user is responsible for setting the location and path, which I think should be kept internal of the object.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6010: Python: Fix caching of the PyArrowFileIO

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1002156653


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -196,8 +214,9 @@ def new_input(self, location: str) -> PyArrowFile:
         Returns:
             PyArrowFile: A PyArrowFile instance for the given location
         """
-        fs, path = self.get_fs_and_path(location)
-        return PyArrowFile(fs=fs, location=location, path=path)
+        santized_location = self.normalize_location(location)
+        fs = self.get_fs(santized_location.scheme)
+        return PyArrowFile(fs=fs, location=santized_location)

Review Comment:
   Normalized rather than `sanitized`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1013641436


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -93,11 +88,23 @@ class PyArrowFile(InputFile, OutputFile):
         >>> # output_file.create().write(b'foobytes')
     """
 
-    def __init__(self, location: str, path: str, fs: FileSystem):
+    @staticmethod
+    def normalize_path(location: str) -> str:

Review Comment:
   I've reverted most of the changes



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1013488618


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -93,11 +88,23 @@ class PyArrowFile(InputFile, OutputFile):
         >>> # output_file.create().write(b'foobytes')
     """
 
-    def __init__(self, location: str, path: str, fs: FileSystem):
+    @staticmethod
+    def normalize_path(location: str) -> str:

Review Comment:
   I don't think this class needs to change at all. The file system is already parsing the URI and can just provide the path like before.
   
   I think you could probably argue that this class could parse the path out of the `location` in `__init__` if the path isn't provided, but I think that's minor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko closed pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
Fokko closed pull request #6010: Python: Fix PyArrowFileIO caching
URL: https://github.com/apache/iceberg/pull/6010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6010: Python: Fix caching of the PyArrowFileIO

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1003109634


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -66,10 +74,14 @@ class PyArrowFile(InputFile, OutputFile):
         >>> # output_file.create().write(b'foobytes')
     """
 
-    def __init__(self, location: str, path: str, fs: FileSystem):
+    _normalized_location: NormalizedLocation
+
+    def __init__(self, location: Union[str, NormalizedLocation], fs: FileSystem):

Review Comment:
   Got it, I've reverted the bit and passed in the original location. I think PyArrow itself should be responsible for handling the URL. Otherwise, users need to construct the path when they directly initialize the PyArrow Input or Output file. Let me know what you think of the current approach



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6010: Python: Fix caching of the PyArrowFileIO

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1002159120


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -66,10 +74,14 @@ class PyArrowFile(InputFile, OutputFile):
         >>> # output_file.create().write(b'foobytes')
     """
 
-    def __init__(self, location: str, path: str, fs: FileSystem):
+    _normalized_location: NormalizedLocation
+
+    def __init__(self, location: Union[str, NormalizedLocation], fs: FileSystem):

Review Comment:
   This works, but I think it is probably cleaner to pass both `location` and `path` into `PyArrowFile` like it was before.
   
   The problem is that this needs to re-create the location that was passed in to create this, rather than just passing the path directly. It also makes a few changes in this class that are unnecessary if `PyArrowFileIO` passes location, path, and fs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6010: Python: Fix caching of the PyArrowFileIO

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#issuecomment-1286290782

   > Also refactored the path/location structure since PyArrow requires URLs without a scheme.
   
   Can you explain this further? Why was it working before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1013488063


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -223,8 +236,8 @@ def new_input(self, location: str) -> PyArrowFile:
         Returns:
             PyArrowFile: A PyArrowFile instance for the given location
         """
-        fs, path = self.get_fs_and_path(location)
-        return PyArrowFile(fs=fs, location=location, path=path)
+        fs = self.get_fs(self.extract_scheme(location))
+        return PyArrowFile(fs=fs, location=location)

Review Comment:
   I don't understand why this parses the URI here and in `PyArrowFile`. I think this should just leave PyArrowFile the way that it was and pass in the path after parsing it here:
   
   ```python
       scheme, path = parse_location(location)
       fs = self.get_fs(scheme)
       return PyArrowFile(fs=fs, location=location, path=path)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
rdblue merged PR #6010:
URL: https://github.com/apache/iceberg/pull/6010


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6010: Python: Fix caching of the PyArrowFileIO

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#issuecomment-1286621641

   Sorry for the limited context. I'm working on converting the files into a PyArrow Dataset. This requires passing in a single filesystem and a list of files. The files-paths can't have a scheme, since that will have PyArrow throw an error. The idea behind it is that the S3FileSystem already indicates that it is an S3 path.
   
   By splitting this we can re-use this logic to pass the list of files to the Dataset:
   
   ```python
   io = self.table.io()
   if isinstance(io, FsspecFileIO):
       ...
   elif isinstance(io, PyArrowFileIO):
       # We should not use internal methods
       fs = io._get_fs_and_path(files[0])[0]
       # This is also awkward, PyArrow requires removing the s3a://
       files = ["".join(urlparse(file)[1:3]) for file in files]
   else:
       raise ValueError(f"Unsupported FileSystem: {io}")
   ```
   
   Convert it into:
   ```python
   io = self.table.io()
   if isinstance(io, FsspecFileIO):
       ...
   elif isinstance(io, PyArrowFileIO):
       normalized_files = map(PyArrowFileIO.normalize_location, files)
       fs = io.get_fs(next(files).scheme)
       files = [file.path for file in normalized_files]
   else:
       raise ValueError(f"Unsupported FileSystem: {io}")
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#issuecomment-1304137072

   Looks great. Thanks, @Fokko!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6010: Python: Fix PyArrowFileIO caching

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6010:
URL: https://github.com/apache/iceberg/pull/6010#discussion_r1013633552


##########
python/pyiceberg/io/pyarrow.py:
##########
@@ -223,8 +236,8 @@ def new_input(self, location: str) -> PyArrowFile:
         Returns:
             PyArrowFile: A PyArrowFile instance for the given location
         """
-        fs, path = self.get_fs_and_path(location)
-        return PyArrowFile(fs=fs, location=location, path=path)
+        fs = self.get_fs(self.extract_scheme(location))
+        return PyArrowFile(fs=fs, location=location)

Review Comment:
   My main argument is that if you want to use the PyArrowFile on its own, then the user is responsible for setting the location and path, which I think the path should be kept internal to the object since it is a derivative of the location.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org