You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@avro.apache.org by David Beswick <Da...@bupa.com.au> on 2019/06/19 03:02:52 UTC

Python 3: AssertionError on reading empty file

Hello,

I'm getting this problem with the PIP package avro-python3-1.9.0.

The package seems to have an issue with raw codec files containing no records (just a '0' block count), but which then following the empty block record with a sync marker. I've attached an example file but I'm not sure if it'll come through - let me know if you'd like it. it's been written by a process external to us.

The "avro-tools" package reads these kinds of files fine.

The problem files generate this traceback and assertion. Example code and traceback:


from avro.datafile import DataFileReader, DataFileWriter

with DataFileReader(open("28.avro", 'rb'), DatumReader()) as r:
    print(r.meta)
    for rec in r:
        print(rec)


Traceback (most recent call last):
  File "./test.py", line 31, in <module>
    for rec in r:
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/datafile.py", line 526, in __next__
    datum = self.datum_reader.read(self.datum_decoder)
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 489, in read
    return self.read_data(self.writer_schema, self.reader_schema, decoder)
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 534, in read_data
    return self.read_record(writer_schema, reader_schema, decoder)
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 734, in read_record
    field_val = self.read_data(field.type, readers_field.type, decoder)
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 512, in read_data
    return decoder.read_utf8()
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 257, in read_utf8
    input_bytes = self.read_bytes()
  File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 249, in read_bytes
    assert (nbytes >= 0), nbytes
AssertionError: -11


I think the issue is in the __next__ function of DataFileReader, which seems to assume that a datum will always follow a block header read. The following implementation fixes the bug for me. Is it correct?

  def __next__(self):
    """Return the next datum in the file."""
    while True:
        if self.block_count == 0:
            if self.is_EOF():
                raise StopIteration
            elif self._skip_sync():
                pass
            else:
                self._read_block_header()
        else:
            datum = self.datum_reader.read(self.datum_decoder)
            self._block_count -= 1
            return datum


Please also note that it seems two __next__ methods have been mistakenly put in this class.

Regards,
David

Bupa A&NZ email disclaimer: The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

RE: Python 3: AssertionError on reading empty file

Posted by David Beswick <Da...@bupa.com.au>.
Thanks for your reply Michael – I’ve created https://issues.apache.org/jira/browse/AVRO-2432 and will attach a patch.


From: Michael A. Smith <mi...@smith-li.com> 
Sent: Wednesday, 19 June 2019 8:50 PM
To: user@avro.apache.org
Subject: Re: Python 3: AssertionError on reading empty file

CAUTION: This email originated from outside of BUPA. Do not click on links or open attachments unless you recognise the sender and know the content is safe.
________________________________________
Interesting question. The spec says a file has to start with a header. 

http://avro.apache.org/docs/current/spec.html#Object+Container+Files

However, it may still be appropriate to have consistent behavior with the tools/java implementation. We could discuss amending the spec to be clearer about this case either way on the dev list.

Also the duplicate __next__ is surely a mistake. Would you please open a ticket and consider making a pull request as well?

https://avro.apache.org/issue_tracking.html


On Tue, Jun 18, 2019 at 23:03 David Beswick <ma...@bupa.com.au> wrote:
Hello,

I'm getting this problem with the PIP package avro-python3-1.9.0.

The package seems to have an issue with raw codec files containing no records (just a '0' block count), but which then following the empty block record with a sync marker. I've attached an example file but I'm not sure if it'll come through - let me know if you'd like it. it's been written by a process external to us.

The "avro-tools" package reads these kinds of files fine.

The problem files generate this traceback and assertion. Example code and traceback:


from avro.datafile import DataFileReader, DataFileWriter

with DataFileReader(open("28.avro", 'rb'), DatumReader()) as r:
print(r.meta)
for rec in r:
print(rec)


Traceback (most recent call last):
File "./test.py", line 31, in <module>
for rec in r:
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/datafile.py", line 526, in __next__
datum = self.datum_reader.read(self.datum_decoder)
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 489, in read
return self.read_data(self.writer_schema, self.reader_schema, decoder)
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 534, in read_data
return self.read_record(writer_schema, reader_schema, decoder)
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 734, in read_record
field_val = self.read_data(field.type, readers_field.type, decoder)
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 512, in read_data
return decoder.read_utf8()
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 257, in read_utf8
input_bytes = self.read_bytes()
File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line 249, in read_bytes
assert (nbytes >= 0), nbytes
AssertionError: -11


I think the issue is in the __next__ function of DataFileReader, which seems to assume that a datum will always follow a block header read. The following implementation fixes the bug for me. Is it correct?

def __next__(self):
"""Return the next datum in the file."""
while True:
if self.block_count == 0:
if self.is_EOF():
raise StopIteration
elif self._skip_sync():
pass
else:
self._read_block_header()
else:
datum = self.datum_reader.read(self.datum_decoder)
self._block_count -= 1
return datum


Please also note that it seems two __next__ methods have been mistakenly put in this class.

Regards,
David
Bupa A&NZ email disclaimer: The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

Bupa A&NZ email disclaimer: The information contained in this email and any attachments is confidential and may be subject to copyright or other intellectual property protection. If you are not the intended recipient, you are not authorized to use or disclose this information, and we request that you notify us by reply mail or telephone and delete the original message from your mail system.

Re: Python 3: AssertionError on reading empty file

Posted by "Michael A. Smith" <mi...@smith-li.com>.
Interesting question. The spec says a file has to start with a header.

http://avro.apache.org/docs/current/spec.html#Object+Container+Files

However, it may still be appropriate to have consistent behavior with the
tools/java implementation. We could discuss amending the spec to be clearer
about this case either way on the dev list.

Also the duplicate __next__ is surely a mistake. Would you please open a
ticket and consider making a pull request as well?

https://avro.apache.org/issue_tracking.html


On Tue, Jun 18, 2019 at 23:03 David Beswick <Da...@bupa.com.au>
wrote:

> Hello,
>
> I'm getting this problem with the PIP package avro-python3-1.9.0.
>
> The package seems to have an issue with raw codec files containing no
> records (just a '0' block count), but which then following the empty block
> record with a sync marker. I've attached an example file but I'm not sure
> if it'll come through - let me know if you'd like it. it's been written by
> a process external to us.
>
> The "avro-tools" package reads these kinds of files fine.
>
> The problem files generate this traceback and assertion. Example code and
> traceback:
>
>
> from avro.datafile import DataFileReader, DataFileWriter
>
> with DataFileReader(open("28.avro", 'rb'), DatumReader()) as r:
> print(r.meta)
> for rec in r:
> print(rec)
>
>
> Traceback (most recent call last):
> File "./test.py", line 31, in <module>
> for rec in r:
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/datafile.py",
> line 526, in __next__
> datum = self.datum_reader.read(self.datum_decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 489, in read
> return self.read_data(self.writer_schema, self.reader_schema, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 534, in read_data
> return self.read_record(writer_schema, reader_schema, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 734, in read_record
> field_val = self.read_data(field.type, readers_field.type, decoder)
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 512, in read_data
> return decoder.read_utf8()
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 257, in read_utf8
> input_bytes = self.read_bytes()
> File "/home/dbeswick/.local/lib/python3.6/site-packages/avro/io.py", line
> 249, in read_bytes
> assert (nbytes >= 0), nbytes
> AssertionError: -11
>
>
> I think the issue is in the __next__ function of DataFileReader, which
> seems to assume that a datum will always follow a block header read. The
> following implementation fixes the bug for me. Is it correct?
>
> def __next__(self):
> """Return the next datum in the file."""
> while True:
> if self.block_count == 0:
> if self.is_EOF():
> raise StopIteration
> elif self._skip_sync():
> pass
> else:
> self._read_block_header()
> else:
> datum = self.datum_reader.read(self.datum_decoder)
> self._block_count -= 1
> return datum
>
>
> Please also note that it seems two __next__ methods have been mistakenly
> put in this class.
>
> Regards,
> David
>
> Bupa A&NZ email disclaimer: The information contained in this email and
> any attachments is confidential and may be subject to copyright or other
> intellectual property protection. If you are not the intended recipient,
> you are not authorized to use or disclose this information, and we request
> that you notify us by reply mail or telephone and delete the original
> message from your mail system.
>