You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Gabor Szadovszky (Jira)" <ji...@apache.org> on 2020/12/07 11:09:00 UTC

[jira] [Resolved] (PARQUET-1947) DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data

     [ https://issues.apache.org/jira/browse/PARQUET-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gabor Szadovszky resolved PARQUET-1947.
---------------------------------------
    Resolution: Fixed

> DeprecatedParquetInputFormat in CombineFileInputFormat would produce wrong data
> -------------------------------------------------------------------------------
>
>                 Key: PARQUET-1947
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1947
>             Project: Parquet
>          Issue Type: Bug
>          Components: parquet-cascading
>            Reporter: Daniel Dai
>            Assignee: Daniel Dai
>            Priority: Major
>         Attachments: Part1.java
>
>
> When we read parquet file using cascading 2, we observe wrong data in the file boundary when we turn on input combine in cascading (setUseCombinedInput to true).
> This can be reproduced easily with two parquet input files, each containing one record. A simple cascading application (attached) read the two input with setUseCombinedInput(true). What we get is the duplicated record in the first input file and the missing record in the second input file.
> Here is the call sequence to understand what happen after the last record of first input:
> 1. cascading invokes DeprecatedParquetInputFormat.createValue(), that's the last record of first input again
> 2. CombineFileRecordReader invokes RecordReader.next and reach the EOF of first input
> 3. CombineFileRecordReader creates a new DeprecatedParquetInputFormat.RecordReaderWrapper, which creates the new "value" variable containing the first record of second input
> 4. CombineFileRecordReader invokes RecordReader.next on the new RecordReaderWrapper, but since firstRecord flag is on, next does not do anything
> 5. Thus the "value" variable containing the first record of second input is lost, and cascading is reusing the last record of first input



--
This message was sent by Atlassian Jira
(v8.3.4#803005)