You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2023/01/06 11:00:06 UTC

[GitHub] [iceberg] Fokko opened a new pull request, #6532: Python: Fix reading 0-bytes binary

Fokko opened a new pull request, #6532:
URL: https://github.com/apache/iceberg/pull/6532

   When a binary field would be zero bytes, we would still read the stream, causing a EOFErrors:
   
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/manifest.py", line 139, in read_manifest_entry
       for record in reader:
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/file.py", line 194, in __next__
       return self.__next__()
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/file.py", line 186, in __next__
       return next(self.block)
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/file.py", line 113, in __next__
       return self.reader.read(self.block_decoder)
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/reader.py", line 260, in read
       struct.set(pos, field.read(decoder))  # later: pass reuse in here
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/reader.py", line 260, in read
       struct.set(pos, field.read(decoder))  # later: pass reuse in here
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/reader.py", line 233, in read
       return self.option.read(decoder)
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/reader.py", line 313, in read
       read_items[key] = self.value.read(decoder)
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/reader.py", line 200, in read
       return decoder.read_bytes()
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/decoder.py", line 118, in read_bytes
       return self.read(self.read_int())
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/decoder.py", line 54, in read
       raise EOFError
   EOFError
   ```
   
   Fixes #6435
   
   Many thanks to @joshuarobinson for reporting this 🙏🏻 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063848387


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -115,7 +115,8 @@ def read_bytes(self) -> bytes:
         """
         Bytes are encoded as a long followed by that many bytes of data.
         """
-        return self.read(self.read_int())
+        num_bytes = self.read_int()
+        return self.read(num_bytes) if num_bytes > 0 else b""

Review Comment:
   Ah, this makes more sense! So we were requesting 0 bytes and hitting an `EOFError` because of it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063849334


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   Just to capture this somewhere, Python does use [0 bytes to signal EOF](https://docs.python.org/3/library/io.html#io.RawIOBase.read):
   
   > If 0 bytes are returned, and size was not 0, this indicates end of file.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063851662


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")
         elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
+            raise EOFError(f"Read {read_len} bytes, expected {n} bytes")

Review Comment:
   I don't think this is quite correct. If read_len is 0 then `EOFError` is correct. But it is perfectly fine for a read to return fewer bytes than requested. From [RawIOBase.read docs](https://docs.python.org/3/library/io.html#io.RawIOBase.read):
   
   > Read up to size bytes from the object and return them. As a convenience, if size is unspecified or -1, all bytes until EOF are returned. Otherwise, only one system call is ever made. Fewer than size bytes may be returned if the operating system call returns fewer than size bytes.
   
   If fewer bytes are returned this can either loop and call read until the required bytes are returned ("read fully"), or it can return a shorter array. Since this is a higher-level API, I think it should have read fully behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063713132


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   I'm not too opinionated about this, but it looks like a `ValueError` is not the most appropriate one:
   <img width="827" alt="image" src="https://user-images.githubusercontent.com/1134248/211082912-d503f66a-7bcd-45c1-858e-3377b0e513b3.png">



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063341197


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")
         elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
+            raise EOFError(f"Read {read_len} bytes, expected {n} bytes")

Review Comment:
   I had to change this to an EOF error. Since that's the error that we catch after finish reading a block:
   ```
   Traceback (most recent call last):
     File "<stdin>", line 1, in <module>
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/manifest.py", line 139, in read_manifest_entry
       for record in reader:
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/file.py", line 189, in __next__
       new_block = self._read_block()
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/file.py", line 172, in _read_block
       block_records = self.decoder.read_int()
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/decoder.py", line 71, in read_int
       b = ord(self.read(1))
     File "/Users/fokkodriesprong/Desktop/iceberg/python/pyiceberg/avro/decoder.py", line 56, in read
       raise ValueError(f"Read {read_len} bytes, expected {n} bytes")
   ValueError: Read 0 bytes, expected 1 bytes
   ```



##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   I think this is more accurate, since zero is not negative



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064187523


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   This isn't necessarily negative. Most likely, it is going to be 0, so this would look strange. How about `f"EOF: read {read_len} bytes"`



##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   This isn't necessarily negative. Most likely, it is going to be 0, so this would look strange. How about `f"EOF: read {read_len} bytes"`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064187585


##########
python/tests/avro/test_decoder.py:
##########
@@ -109,7 +109,7 @@ class OneByteAtATimeInputStream(InputStream):
 
     def read(self, size: int = 0) -> bytes:
         self.pos += 1
-        return int.to_bytes(1, self.pos, byteorder="little")
+        return self.pos.to_bytes(1, byteorder="little")

Review Comment:
   Why was this needed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko merged pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko merged PR #6532:
URL: https://github.com/apache/iceberg/pull/6532


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063849821


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -115,7 +115,8 @@ def read_bytes(self) -> bytes:
         """
         Bytes are encoded as a long followed by that many bytes of data.
         """
-        return self.read(self.read_int())
+        num_bytes = self.read_int()
+        return self.read(num_bytes) if num_bytes > 0 else b""

Review Comment:
   It would also make sense to put this behavior into `read(num_bytes)` since we could have other callers ask for 0 bytes by accident.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064191436


##########
python/tests/avro/test_decoder.py:
##########
@@ -109,7 +109,7 @@ class OneByteAtATimeInputStream(InputStream):
 
     def read(self, size: int = 0) -> bytes:
         self.pos += 1
-        return int.to_bytes(1, self.pos, byteorder="little")
+        return self.pos.to_bytes(1, byteorder="little")

Review Comment:
   The test was actually incorrect. The `self.pos` is at the position of length, and the name (`OneByteAtATimeInputStream`) suggests that it only returns one byte at a time.
   
   Without the change, the test fails:
   ```
       def test_read_single_byte_at_the_time() -> None:
           decoder = BinaryDecoder(OneByteAtATimeInputStream())
   >       assert decoder.read(2) == b"\x01\x02"
   E       AssertionError: assert bytearray(b'\x01\x01\x00') == b'\x01\x02'
   E         At index 1 diff: 1 != 2
   E         Left contains one more item: 0
   E         Use -v to get more diff
   ```
   It will return more bytes each time it is called:
   
   <img width="586" alt="image" src="https://user-images.githubusercontent.com/1134248/211215078-ad04d82f-116f-4d29-a540-545860b4013f.png">
   
   <img width="586" alt="image" src="https://user-images.githubusercontent.com/1134248/211215085-d52e6937-9d68-4c8d-a2fe-ff4ed5b5686a.png">
   
   <img width="586" alt="image" src="https://user-images.githubusercontent.com/1134248/211215090-0d3bb081-a781-4910-8df4-77f253b7f2d5.png">
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064187384


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")
+            data += data_read

Review Comment:
   Looks like this will reallocate and copy, according to [this post on efficient byte operations](https://www.guyrutenberg.com/2020/04/04/fast-bytes-concatenation-in-python/). From the post, it looks like we should either keep a list of read buffers and call join at the end, or use `bytearray` to accumulate data and convert at the end.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064193045


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")
+            data += data_read

Review Comment:
   I went with the list approach. It is slightly slower than the bytearray approach, but in our case we'll have to convert the `bytearray` back to bytes which also introduces a penalty:
   
   ```
   ```



##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")
+            data += data_read

Review Comment:
   I went with the list approach. It is slightly slower than the `bytearray` approach, but in our case, we'll have to convert the `bytearray` back to bytes which also introduces a penalty:
   
   ```
   ➜  python git:(fd-fix-eof) ✗ python3
   Python 3.10.9 (main, Dec 15 2022, 17:11:09) [Clang 14.0.0 (clang-1400.0.29.202)] on darwin
   Type "help", "copyright", "credits" or "license" for more information.
   >>> h1 = """
   ... ret = bytearray()
   ... for i in range(2**10):
   ...     ret += b'a' * 2**10
   ... ret
   ... """
   >>> 
   >>> h2 = """
   ... ret = bytearray()
   ... for i in range(2**10):
   ...     ret += b'a' * 2**10
   ... bytes(ret)
   ... """
   >>> 
   >>> g = """
   ... ret = list()
   ... for i in range(2**10):
   ...     ret.append(b'a' * 2**10)
   ... b''.join(ret)
   ... """
   >>> 
   >>> import timeit
   >>> print(timeit.timeit(stmt=h1, number=100000))
   4.586242791032419
   >>> print(timeit.timeit(stmt=h2, number=100000))
   6.316140374983661
   >>> print(timeit.timeit(stmt=g, number=100000))
   4.913256458006799
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] Fokko commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
Fokko commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1064190334


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -48,13 +48,18 @@ def read(self, n: int) -> bytes:
         """
         if n < 0:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
-        read_bytes = self._input_stream.read(n)
-        read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
-        elif read_len != n:
-            raise ValueError(f"Read {len(read_bytes)} bytes, expected {n} bytes")
-        return read_bytes
+        data = b""
+
+        n_remaining = n
+        while n_remaining > 0:
+            data_read = self._input_stream.read(n_remaining)
+            read_len = len(data_read)
+            if read_len <= 0:
+                raise EOFError(f"Got negative length: {read_len}")
+            data += data_read

Review Comment:
   A good call!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a diff in pull request #6532: Python: Fix reading 0-bytes binary

Posted by GitBox <gi...@apache.org>.
rdblue commented on code in PR #6532:
URL: https://github.com/apache/iceberg/pull/6532#discussion_r1063624361


##########
python/pyiceberg/avro/decoder.py:
##########
@@ -50,10 +50,10 @@ def read(self, n: int) -> bytes:
             raise ValueError(f"Requested {n} bytes to read, expected positive integer.")
         read_bytes = self._input_stream.read(n)
         read_len = len(read_bytes)
-        if read_len <= 0:
-            raise EOFError
+        if read_len < 0:
+            raise EOFError(f"Got negative length: {read_len}")

Review Comment:
   If I remember correctly, I think that this is how Python streams signal EOF. Now that we're using an Arrow stream, maybe that has changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org