You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by "Michael Coon (JIRA)" <ji...@apache.org> on 2016/09/15 14:46:20 UTC

[jira] [Created] (AVRO-1917) DataFileStream Skips Blocks with hasNext and nextBlock calls

Michael Coon created AVRO-1917:
----------------------------------

             Summary: DataFileStream Skips Blocks with hasNext and nextBlock calls
                 Key: AVRO-1917
                 URL: https://issues.apache.org/jira/browse/AVRO-1917
             Project: Avro
          Issue Type: Bug
          Components: java
            Reporter: Michael Coon


We have a situation where there are potentially large segments of data embedded in an Avro data item. Sometimes, an upstream system will become corrupted and add hundreds of thousands of array items in the structure. When I try to read the item as a Datum record, it blows the heap immediately. 

To catch this situation, I needed to create a custom DatumReader that checked the size of arrays and byte[] and if exceeding a threshold, throws a custom exception that I detect and skip the corrupted item in the file. However, to accomplish the try-catch-skip functionality, I had to use a hasNext, and nextBlock to get the ByteBuffer and send to my reader to catch the situation. Unfortunately, calling "hasNext" and then "nextBlock" actually skips the first block in the underlying data stream. This is because "nextBlock" calls "hasNext", which reads the next block. So I called it, then nextBlock called it, causing bytes to be skipped. My solution is to do a do...while loop and catch "NoSuchElementException", but this is not intuitive and required me to review the code to know how to work around it. The fix is to create a condition that both hasNext and nextBlock agree so that it doesn't advance forward reading the next block in hasNext call.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)