You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/12/09 00:33:55 UTC

[GitHub] [iceberg] cccs-eric opened a new pull request, #6392: Python: Add adlfs support (Azure DataLake FileSystem)

cccs-eric opened a new pull request, #6392:
URL: https://github.com/apache/iceberg/pull/6392

   Adds support for adlfs storage in Python API.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1044921399


##########
python/Makefile:
##########
@@ -26,14 +26,21 @@ lint:
 	poetry run pre-commit run --all-files
 
 test:
-	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3" ${PYTEST_ARGS}
+	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}

Review Comment:
   We're open to better ideas!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1044921558


##########
python/pyiceberg/io/__init__.py:
##########
@@ -257,6 +257,8 @@ def delete(self, location: Union[str, InputFile, OutputFile]) -> None:
     "gcs": [ARROW_FILE_IO],
     "file": [ARROW_FILE_IO],
     "hdfs": [ARROW_FILE_IO],
+    "abfs": [FSSPEC_FILE_IO],

Review Comment:
   Let's just include what is tested for now. We can check for pyarrow support and add it along with tests if it's there.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1045861171


##########
python/tests/io/test_fsspec.py:
##########
@@ -204,6 +204,191 @@ def test_writing_avro_file(generated_manifest_entry_file: Generator[str, None, N
             b2 = in_f.read()
             assert b1 == b2  # Check that bytes of read from local avro file match bytes written to s3
 
+    fsspec_fileio.delete(f"s3://warehouse/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_input_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new input file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+
+    assert isinstance(input_file, fsspec.FsspecInputFile)
+    assert input_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_abfss_output_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new output file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(f"abfss://tests/{filename}")
+
+    assert isinstance(output_file, fsspec.FsspecOutputFile)
+    assert output_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_write_and_read_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test writing and reading a file using FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+    assert input_file.open().read() == b"foo"
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_getting_length_of_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test getting the length of an FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foobar")
+
+    assert len(output_file) == 6
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    assert len(input_file) == 6
+
+    adlfs_fsspec_fileio.delete(output_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_file_tell_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test finding cursor position for an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foobar")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert f.tell() == 0
+    f.seek(1)
+    assert f.tell() == 1
+    f.seek(3)
+    assert f.tell() == 3
+    f.seek(0)
+    assert f.tell() == 0
+
+    adlfs_fsspec_fileio.delete(f"abfss://tests/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_read_specified_bytes_for_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test reading a specified number of bytes from an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert b"f" == f.read(1)
+    f.seek(0)
+    assert b"fo" == f.read(2)
+    f.seek(1)
+    assert b"o" == f.read(1)
+    f.seek(1)
+    assert b"oo" == f.read(2)
+    f.seek(0)
+    assert b"foo" == f.read(999)  # test reading amount larger than entire content length
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_raise_on_opening_file_not_found_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test that an fsspec input file raises appropriately when the adlfs file is not found"""
+
+    filename = str(uuid.uuid4())
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    with pytest.raises(FileNotFoundError):
+        input_file.open().read()
+
+    # filename is not propagated in FileNotFoundError for adlfs

Review Comment:
   Makes sense to me, having a consistent error feedback is valuable.  See changes in fsspec.py -> `FsspecInputFile.open()`.
   I might also open a PR upstream to [adlfs FsSpec ](https://github.com/fsspec/adlfs) to have them change those errors at the root.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1354040055

   @cccs-eric, I just merged $6439 that fixes the pyparsing problem (my fault) so you should be able to rebase and get tests working. Sorry about that!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1358021370

   Thanks, @cccs-eric!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1353168942

   @cccs-eric I forgot one thing, could you also add the `adlfs` option to the docs in `python/mkdocs/`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1044922073


##########
python/dev/docker-compose-azurite.yml:
##########
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+version: "3"
+
+services:
+  azurite:
+    image: mcr.microsoft.com/azure-storage/azurite

Review Comment:
   I'd keep it separate, since we allow testing them separately in the Makefile.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043972849


##########
python/tests/io/test_fsspec.py:
##########
@@ -204,6 +204,191 @@ def test_writing_avro_file(generated_manifest_entry_file: Generator[str, None, N
             b2 = in_f.read()
             assert b1 == b2  # Check that bytes of read from local avro file match bytes written to s3
 
+    fsspec_fileio.delete(f"s3://warehouse/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_input_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new input file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+
+    assert isinstance(input_file, fsspec.FsspecInputFile)
+    assert input_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_abfss_output_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new output file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(f"abfss://tests/{filename}")
+
+    assert isinstance(output_file, fsspec.FsspecOutputFile)
+    assert output_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_write_and_read_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test writing and reading a file using FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+    assert input_file.open().read() == b"foo"
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_getting_length_of_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test getting the length of an FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foobar")
+
+    assert len(output_file) == 6
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    assert len(input_file) == 6
+
+    adlfs_fsspec_fileio.delete(output_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_file_tell_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test finding cursor position for an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foobar")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert f.tell() == 0
+    f.seek(1)
+    assert f.tell() == 1
+    f.seek(3)
+    assert f.tell() == 3
+    f.seek(0)
+    assert f.tell() == 0
+
+    adlfs_fsspec_fileio.delete(f"abfss://tests/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_read_specified_bytes_for_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test reading a specified number of bytes from an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert b"f" == f.read(1)
+    f.seek(0)
+    assert b"fo" == f.read(2)
+    f.seek(1)
+    assert b"o" == f.read(1)
+    f.seek(1)
+    assert b"oo" == f.read(2)
+    f.seek(0)
+    assert b"foo" == f.read(999)  # test reading amount larger than entire content length
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_raise_on_opening_file_not_found_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test that an fsspec input file raises appropriately when the adlfs file is not found"""
+
+    filename = str(uuid.uuid4())
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    with pytest.raises(FileNotFoundError):
+        input_file.open().read()
+
+    # filename is not propagated in FileNotFoundError for adlfs

Review Comment:
   All those test cases are "identical" to S3 except this one.  The raised error does not contain a filename, so the same assert cannot be done.  Speaking of identical tests, would it make sense to parameterize them?  Is there a way to make that work with PyTest mark?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1049565481


##########
python/Makefile:
##########
@@ -26,14 +26,21 @@ lint:
 	poetry run pre-commit run --all-files
 
 test:
-	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3" ${PYTEST_ARGS}
+	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}

Review Comment:
   Yeah, that sounds good.  I'll leave my code as-is and if my PR goes in before yours, you'll have to modify this line as a conflict resolution.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko merged PR #6392:
URL: https://github.com/apache/iceberg/pull/6392


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1353167480

   @cccs-eric it looks like we have to add `pyparsing` to the `pyproject.toml`. I don't know why it was working before, but it should have been added in https://github.com/apache/iceberg/pull/6259


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043972849


##########
python/tests/io/test_fsspec.py:
##########
@@ -204,6 +204,191 @@ def test_writing_avro_file(generated_manifest_entry_file: Generator[str, None, N
             b2 = in_f.read()
             assert b1 == b2  # Check that bytes of read from local avro file match bytes written to s3
 
+    fsspec_fileio.delete(f"s3://warehouse/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_input_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new input file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+
+    assert isinstance(input_file, fsspec.FsspecInputFile)
+    assert input_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_abfss_output_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new output file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(f"abfss://tests/{filename}")
+
+    assert isinstance(output_file, fsspec.FsspecOutputFile)
+    assert output_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_write_and_read_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test writing and reading a file using FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+    assert input_file.open().read() == b"foo"
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_getting_length_of_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test getting the length of an FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foobar")
+
+    assert len(output_file) == 6
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    assert len(input_file) == 6
+
+    adlfs_fsspec_fileio.delete(output_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_file_tell_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test finding cursor position for an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foobar")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert f.tell() == 0
+    f.seek(1)
+    assert f.tell() == 1
+    f.seek(3)
+    assert f.tell() == 3
+    f.seek(0)
+    assert f.tell() == 0
+
+    adlfs_fsspec_fileio.delete(f"abfss://tests/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_read_specified_bytes_for_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test reading a specified number of bytes from an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert b"f" == f.read(1)
+    f.seek(0)
+    assert b"fo" == f.read(2)
+    f.seek(1)
+    assert b"o" == f.read(1)
+    f.seek(1)
+    assert b"oo" == f.read(2)
+    f.seek(0)
+    assert b"foo" == f.read(999)  # test reading amount larger than entire content length
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_raise_on_opening_file_not_found_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test that an fsspec input file raises appropriately when the adlfs file is not found"""
+
+    filename = str(uuid.uuid4())
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    with pytest.raises(FileNotFoundError):
+        input_file.open().read()
+
+    # filename is not propagated in FileNotFoundError for adlfs

Review Comment:
   All those test cases are "identical" to S3 except this one.  The raised error does not contain a filename, so the same assert cannot be done.  Speaking of identical tests, would it make sense to parameterize them?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043974117


##########
python/tests/conftest.py:
##########
@@ -1259,3 +1276,22 @@ def fixture_glue(_aws_credentials: None) -> Generator[boto3.client, None, None]:
     """Mocked glue client"""
     with mock_glue():
         yield boto3.client("glue", region_name="us-east-1")
+
+
+@pytest.fixture
+def adlfs_fsspec_fileio(request: pytest.FixtureRequest) -> Generator[FsspecFileIO, None, None]:
+    from azure.storage.blob import BlobServiceClient

Review Comment:
   mypy is complaining about this import, not sure how to make it work since adlfs is an optional dependency.  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1045861171


##########
python/tests/io/test_fsspec.py:
##########
@@ -204,6 +204,191 @@ def test_writing_avro_file(generated_manifest_entry_file: Generator[str, None, N
             b2 = in_f.read()
             assert b1 == b2  # Check that bytes of read from local avro file match bytes written to s3
 
+    fsspec_fileio.delete(f"s3://warehouse/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_input_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new input file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+
+    assert isinstance(input_file, fsspec.FsspecInputFile)
+    assert input_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_abfss_output_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new output file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(f"abfss://tests/{filename}")
+
+    assert isinstance(output_file, fsspec.FsspecOutputFile)
+    assert output_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_write_and_read_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test writing and reading a file using FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+    assert input_file.open().read() == b"foo"
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_getting_length_of_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test getting the length of an FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foobar")
+
+    assert len(output_file) == 6
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    assert len(input_file) == 6
+
+    adlfs_fsspec_fileio.delete(output_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_file_tell_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test finding cursor position for an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foobar")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert f.tell() == 0
+    f.seek(1)
+    assert f.tell() == 1
+    f.seek(3)
+    assert f.tell() == 3
+    f.seek(0)
+    assert f.tell() == 0
+
+    adlfs_fsspec_fileio.delete(f"abfss://tests/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_read_specified_bytes_for_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test reading a specified number of bytes from an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert b"f" == f.read(1)
+    f.seek(0)
+    assert b"fo" == f.read(2)
+    f.seek(1)
+    assert b"o" == f.read(1)
+    f.seek(1)
+    assert b"oo" == f.read(2)
+    f.seek(0)
+    assert b"foo" == f.read(999)  # test reading amount larger than entire content length
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_raise_on_opening_file_not_found_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test that an fsspec input file raises appropriately when the adlfs file is not found"""
+
+    filename = str(uuid.uuid4())
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    with pytest.raises(FileNotFoundError):
+        input_file.open().read()
+
+    # filename is not propagated in FileNotFoundError for adlfs

Review Comment:
   Makes sense to me, having a consistent error feedback is valuable.  See changes in fsspec.py -> `FsspecInputFile.open()`.
   I might also open a PR upstream to [FsSpec ](https://github.com/fsspec/filesystem_spec)to have them change those errors at the root.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1356970169

   @rdblue All clear now


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1049285936


##########
python/Makefile:
##########
@@ -26,14 +26,21 @@ lint:
 	poetry run pre-commit run --all-files
 
 test:
-	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3" ${PYTEST_ARGS}
+	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}

Review Comment:
   I've been looking at this in https://github.com/apache/iceberg/pull/6398/ as well. I've added:
   
   ```python
   def pytest_collection_modifyitems(items, config):
       for item in items:
           if not any(item.iter_markers()):
               item.add_marker("unmarked")
   ```
   
   Which will automatically add the mark `unmarked` to each of the tests that doesn't have a marked, creating their own group. WDYT of that?



##########
python/tests/conftest.py:
##########
@@ -76,13 +76,30 @@
 
 
 def pytest_addoption(parser: pytest.Parser) -> None:
+    # S3 options
     parser.addoption(
         "--s3.endpoint", action="store", default="http://localhost:9000", help="The S3 endpoint URL for tests marked as s3"
     )
     parser.addoption("--s3.access-key-id", action="store", default="admin", help="The AWS access key ID for tests marked as s3")
     parser.addoption(
         "--s3.secret-access-key", action="store", default="password", help="The AWS secret access key ID for tests marked as s3"
     )
+    # ADLFS options
+    parser.addoption(
+        "--adlfs.endpoint",
+        action="store",
+        default="http://127.0.0.1:10000",
+        help="The ADLS endpoint URL for tests marked as adlfs",
+    )
+    parser.addoption(
+        "--adlfs.account-name", action="store", default="devstoreaccount1", help="The ADLS account key for tests marked as adlfs"
+    )
+    parser.addoption(
+        "--adlfs.account-key",
+        action="store",
+        default="Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",

Review Comment:
   Maybe we should add this Github comment as a code comment, in case we start doing credential scanning one day.



##########
python/pyiceberg/io/fsspec.py:
##########
@@ -143,8 +153,15 @@ def open(self) -> InputStream:
 
         Returns:
             OpenFile: An fsspec compliant file-like object
+
+        Raises:
+            FileNotFoundError: If the file does not exist
         """
-        return self._fs.open(self.location, "rb")
+        try:
+            return self._fs.open(self.location, "rb")
+        except FileNotFoundError as e:
+            # To have a consistent error handling experience, make sure exception contains missing file location.

Review Comment:
   This is perfect! We don't want to return implementation-specific errors



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1354060350

   @Fokko One last thing that I haven't done is modifying verify-release.md for the integration tests.  Should I add `test-adlfs` in there?
   https://github.com/apache/iceberg/blob/master/python/mkdocs/docs/verify-release.md?plain=1#L86-L90


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043970905


##########
python/dev/docker-compose-azurite.yml:
##########
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+version: "3"
+
+services:
+  azurite:
+    image: mcr.microsoft.com/azure-storage/azurite

Review Comment:
   I named that file `docker-compose-azurite.yml`, wasn't sure if I should rename the other one to `docker-compose-s3.yml` or if both should be combined...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043970905


##########
python/dev/docker-compose-azurite.yml:
##########
@@ -0,0 +1,26 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+version: "3"
+
+services:
+  azurite:
+    image: mcr.microsoft.com/azure-storage/azurite

Review Comment:
   I named that file `docker-compose-azurite.yml`, wasn't sure if I should rename the other one for `-s3.yml` or if both should be combined...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043971511


##########
python/pyiceberg/io/__init__.py:
##########
@@ -257,6 +257,8 @@ def delete(self, location: Union[str, InputFile, OutputFile]) -> None:
     "gcs": [ARROW_FILE_IO],
     "file": [ARROW_FILE_IO],
     "hdfs": [ARROW_FILE_IO],
+    "abfs": [FSSPEC_FILE_IO],

Review Comment:
   Didn't know if PyArrow would also apply here.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1044921819


##########
python/tests/conftest.py:
##########
@@ -1259,3 +1276,22 @@ def fixture_glue(_aws_credentials: None) -> Generator[boto3.client, None, None]:
     """Mocked glue client"""
     with mock_glue():
         yield boto3.client("glue", region_name="us-east-1")
+
+
+@pytest.fixture
+def adlfs_fsspec_fileio(request: pytest.FixtureRequest) -> Generator[FsspecFileIO, None, None]:
+    from azure.storage.blob import BlobServiceClient

Review Comment:
   We add excludes to the pyproject.toml file. You'll see a huge list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1044923034


##########
python/tests/io/test_fsspec.py:
##########
@@ -204,6 +204,191 @@ def test_writing_avro_file(generated_manifest_entry_file: Generator[str, None, N
             b2 = in_f.read()
             assert b1 == b2  # Check that bytes of read from local avro file match bytes written to s3
 
+    fsspec_fileio.delete(f"s3://warehouse/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_input_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new input file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+
+    assert isinstance(input_file, fsspec.FsspecInputFile)
+    assert input_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_new_abfss_output_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test creating a new output file from an fsspec file-io"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(f"abfss://tests/{filename}")
+
+    assert isinstance(output_file, fsspec.FsspecOutputFile)
+    assert output_file.location == f"abfss://tests/{filename}"
+
+
+@pytest.mark.adlfs
+def test_fsspec_write_and_read_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test writing and reading a file using FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(f"abfss://tests/{filename}")
+    assert input_file.open().read() == b"foo"
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_getting_length_of_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test getting the length of an FsspecInputFile and FsspecOutputFile"""
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as f:
+        f.write(b"foobar")
+
+    assert len(output_file) == 6
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    assert len(input_file) == 6
+
+    adlfs_fsspec_fileio.delete(output_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_file_tell_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test finding cursor position for an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foobar")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert f.tell() == 0
+    f.seek(1)
+    assert f.tell() == 1
+    f.seek(3)
+    assert f.tell() == 3
+    f.seek(0)
+    assert f.tell() == 0
+
+    adlfs_fsspec_fileio.delete(f"abfss://tests/{filename}")
+
+
+@pytest.mark.adlfs
+def test_fsspec_read_specified_bytes_for_file_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test reading a specified number of bytes from an fsspec file-io file"""
+
+    filename = str(uuid.uuid4())
+    output_file = adlfs_fsspec_fileio.new_output(location=f"abfss://tests/{filename}")
+    with output_file.create() as write_file:
+        write_file.write(b"foo")
+
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    f = input_file.open()
+
+    f.seek(0)
+    assert b"f" == f.read(1)
+    f.seek(0)
+    assert b"fo" == f.read(2)
+    f.seek(1)
+    assert b"o" == f.read(1)
+    f.seek(1)
+    assert b"oo" == f.read(2)
+    f.seek(0)
+    assert b"foo" == f.read(999)  # test reading amount larger than entire content length
+
+    adlfs_fsspec_fileio.delete(input_file)
+
+
+@pytest.mark.adlfs
+def test_fsspec_raise_on_opening_file_not_found_adlfs(adlfs_fsspec_fileio: FsspecFileIO) -> None:
+    """Test that an fsspec input file raises appropriately when the adlfs file is not found"""
+
+    filename = str(uuid.uuid4())
+    input_file = adlfs_fsspec_fileio.new_input(location=f"abfss://tests/{filename}")
+    with pytest.raises(FileNotFoundError):
+        input_file.open().read()
+
+    # filename is not propagated in FileNotFoundError for adlfs

Review Comment:
   Can we add the filename in the `open` or `read` call by catching `FileNotFoundError` and wrapping it? That seems worth an check in exception handling:
   
   ```python
   def open(self):
       try:
           # original method contents
       except FileNotFoundError as e:
           if str(e) contains self.location:
               raise e
           else:
               raise FileNotFoundError(f"No such file: {self.location}") from e
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1353199075

   @cccs-eric Yes, please do!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1353755124

   > @cccs-eric it looks like we have to add `pyparsing` to the `pyproject.toml`. I don't know why it was working before, but it should have been added in #6259
   
   @Fokko I don't understand what is going on.  `pyparsing` is part of `pyproject.toml` and when I run `make test-s3`, *it works on my machine* ™️ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1354058385

   > @cccs-eric, I just merged $6439 that fixes the pyparsing problem (my fault) so you should be able to rebase and get tests working. Sorry about that!
   
   Thanks @rdblue , build is now passing 🥳 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
rdblue commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1356903295

   @cccs-eric, looks like this was out of date. I tried to fix conflicts, but it is failing tests. Can you fix and then we'll merge?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043971934


##########
python/tests/conftest.py:
##########
@@ -76,13 +76,30 @@
 
 
 def pytest_addoption(parser: pytest.Parser) -> None:
+    # S3 options
     parser.addoption(
         "--s3.endpoint", action="store", default="http://localhost:9000", help="The S3 endpoint URL for tests marked as s3"
     )
     parser.addoption("--s3.access-key-id", action="store", default="admin", help="The AWS access key ID for tests marked as s3")
     parser.addoption(
         "--s3.secret-access-key", action="store", default="password", help="The AWS secret access key ID for tests marked as s3"
     )
+    # ADLFS options
+    parser.addoption(
+        "--adlfs.endpoint",
+        action="store",
+        default="http://127.0.0.1:10000",
+        help="The ADLS endpoint URL for tests marked as adlfs",
+    )
+    parser.addoption(
+        "--adlfs.account-name", action="store", default="devstoreaccount1", help="The ADLS account key for tests marked as adlfs"

Review Comment:
   That account name is the one by default with azurite.



##########
python/tests/conftest.py:
##########
@@ -76,13 +76,30 @@
 
 
 def pytest_addoption(parser: pytest.Parser) -> None:
+    # S3 options
     parser.addoption(
         "--s3.endpoint", action="store", default="http://localhost:9000", help="The S3 endpoint URL for tests marked as s3"
     )
     parser.addoption("--s3.access-key-id", action="store", default="admin", help="The AWS access key ID for tests marked as s3")
     parser.addoption(
         "--s3.secret-access-key", action="store", default="password", help="The AWS secret access key ID for tests marked as s3"
     )
+    # ADLFS options
+    parser.addoption(
+        "--adlfs.endpoint",
+        action="store",
+        default="http://127.0.0.1:10000",
+        help="The ADLS endpoint URL for tests marked as adlfs",
+    )
+    parser.addoption(
+        "--adlfs.account-name", action="store", default="devstoreaccount1", help="The ADLS account key for tests marked as adlfs"
+    )
+    parser.addoption(
+        "--adlfs.account-key",
+        action="store",
+        default="Eby8vdM02xNOcqFlqUwJPLlmEtlCDXJ1OUzFT50uSRZ6IFsuFq2UVErCz4I6tq/K1SZFPTOtr/KBHBeksoGMGw==",

Review Comment:
   That account key is the one by default with azurite.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] cccs-eric commented on a diff in pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
cccs-eric commented on code in PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#discussion_r1043969300


##########
python/Makefile:
##########
@@ -26,14 +26,21 @@ lint:
 	poetry run pre-commit run --all-files
 
 test:
-	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3" ${PYTEST_ARGS}
+	poetry run coverage run --source=pyiceberg/ -m pytest tests/ -m "not s3 and not adlfs" ${PYTEST_ARGS}

Review Comment:
   The "not X and not Y" does not really scale well, could we use another scheme to enable/disable tests?  Comment also applies further down to `test-s3` and `test-adlfs`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on pull request #6392: Python: Add adlfs support (Azure DataLake FileSystem)

Posted by GitBox <gi...@apache.org>.
Fokko commented on PR #6392:
URL: https://github.com/apache/iceberg/pull/6392#issuecomment-1357584071

   Awesome, thanks @cccs-eric for working on this 🚀 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org