You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2022/02/23 10:04:00 UTC

[jira] [Commented] (IMPALA-11134) Impala returns "Couldn't skip rows in file" error for old Parquet file

    [ https://issues.apache.org/jira/browse/IMPALA-11134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496629#comment-17496629 ] 

ASF subversion and git services commented on IMPALA-11134:
----------------------------------------------------------

Commit b60ccabd5b6f09842284c657b910bb65d2f30fe8 in impala's branch refs/heads/master from Zoltan Borok-Nagy
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b60ccab ]

IMPALA-11134: Impala returns "Couldn't skip rows in file" error for old Parquet file

Impala returns "Couldn't skip rows in file" error for old Parquet
file written by an old Impala (e.g. Impala 2.5, 2.6) In DEBUG build
Impala crashes by a DCHECK:

 Check failed: num_buffered_values_ > 0 (-1 vs. 0)

The problem is that in some old Parquet files there can be a mismatch
between 'num_values' in a page and the encoded def/rep levels.
There is usually one more def/rep levels encoded in these files.

In SkipTopLevelRows() we skipped values based on how many def levels are
https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314

Since there are more def levels than values in some old files,
num_buferred_values_ could become negative.

This patch also takes the value of num_buferred_values_ into account
when calculating 'read_count', so we can deal with such files. With
this patch we also include the column name in the "Couldn't skip rows"
error message, so in the future it'll be easier to identify the
problematic columns.

Testing:
 * added Parquet file written by Impala 2.5 and e2e test for it

Change-Id: I568fe59df720ea040be4926812412ba4c1510a26
Reviewed-on: http://gerrit.cloudera.org:8080/18257
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Impala returns "Couldn't skip rows in file" error for old Parquet file
> ----------------------------------------------------------------------
>
>                 Key: IMPALA-11134
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11134
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Zoltán Borók-Nagy
>            Assignee: Zoltán Borók-Nagy
>            Priority: Major
>
> Impala returns "Couldn't skip rows in file" error for old Parquet file written by an old Impala (e.g. Impala 2.5, 2.6)
> In DEBUG build Impala crashes by a DCHECK:
> {noformat}
> F0217 18:21:34.449540 24288 parquet-column-readers.cc:1611] d3407555528be8a8:5ea3fceb00000001] Check failed: num_buffered_values_ > 0 (-1 vs. 0)
> {noformat}
> The problem is that in some old Parquet files there can be a mismatch between 'num_values' in a page and the encoded def/rep levels. There is usually one more def/rep levels encoded in these files.
> In SkipTopLevelRows() we skip values based on how many def levels left:
> https://github.com/apache/impala/blob/92ce6fe48e75d7780efe9a275122554e59aac916/be/src/exec/parquet/parquet-column-readers.cc#L1308-L1314
> Since there are more def levels than values, {{num_buferred_values_}} becomes {{-1}}. I looked at Parquet files written by newer Impala and the number of def levels matches the number of values.
> The workaround is fairly easy, we could also take the value of num_buferred_values_ into account when calculating 'read_count', i.e. min(min(num_buffered_values_, num_rows - i), repeated_run_length); so we can deal with such files.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org