You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Zoltán Borók-Nagy (Jira)" <ji...@apache.org> on 2022/02/23 11:43:00 UTC

[jira] [Resolved] (IMPALA-11134) Impala returns "Couldn't skip rows in file" error for old Parquet file

     [ https://issues.apache.org/jira/browse/IMPALA-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zoltán Borók-Nagy resolved IMPALA-11134.
----------------------------------------
    Fix Version/s: Impala 4.1.0
       Resolution: Fixed

> Impala returns "Couldn't skip rows in file" error for old Parquet file
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-11134
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11134
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>             Fix For: Impala 4.1.0
>
>
> Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6)
> In DEBUG build Impala crashes by a DCHECK:
> {noformat}
> F0217 18:21:34.449540 24288 parquet-column-readers.cc:1611] d3407555528be8a8:5ea3fceb00000001] Check failed: num_buffered_values_ > 0 (-1 vs. 0)
> {noformat}
> The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files.
> In SkipTopLevelRows() we skip values based on how many def levels left:
> https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314
> Since there are more def levels than values, {{num_buferred_values_}} becomes {{-1}}. I looked at Parquet files written by newer Impala and the number of def levels matches the number of values.
> The workaround is fairly easy, we could also take the value of num_buferred_values_ into account when calculating 'read_count', i.e. min(min(num_buffered_values_, num_rows - i), repeated_run_length); so we can deal with such files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)