You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Joe McDonnell (JIRA)" <ji...@apache.org> on 2019/03/14 22:07:00 UTC

[jira] [Commented] (IMPALA-8109) Impala cannot read the gzip files bigger than 2 GB

    [ https://issues.apache.org/jira/browse/IMPALA-8109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16793105#comment-16793105 ] 

Joe McDonnell commented on IMPALA-8109:
---------------------------------------

I have a theory about this:

In Impala 2.10, we modified the file handle cache to improve performance for Parquet ( IMPALA-4623 ). If using a file handle from the cache, the code does not know that it is at the right location, so it must do an extra hdfsSeek() call in DiskIoMgr::ScanRange::Read(). To know the absolute location in the file requires a calculation involving bytes_read_ and this is incorrect when bytes_read_ overflows. It is possible that the code prior to this might not be impacted by an overflow. The file handle cache was enabled by default in Impala 2.12, so that explains why CDH 5.15 shows this issue as it is based on Impala 2.12.

Some other environments have seen this issue. Changing bytes_read_ to an int64_t solves the problem. IMPALA-7543, which [~tarmstrong] mentioned earlier, now uses an int64_t for bytes read. So, this issue does not exist on master.

If my theory is correct, a workaround for your existing environment would be to turn off the file handle cache by setting max_cached_file_handles=0.

I think we can resolve this issue.

> Impala cannot read the gzip files bigger than 2 GB
> --------------------------------------------------
>
>                 Key: IMPALA-8109
>                 URL: https://issues.apache.org/jira/browse/IMPALA-8109
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>    Affects Versions: Impala 2.12.0
>            Reporter: hakki
>            Priority: Major
>
> When querying a partition containing gzip files, the query fails with the error below: 
> WARNINGS: Disk I/O error: Error seeking to -2147483648 in file: hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz: 
> Error(255): Unknown error 255
> Root cause: EOFException: Cannot seek to negative offset
> hdfs://HADOOP_CLUSTER/user/hive/AAA/BBB/datehour=20180910/XXXXXXX.gz file is a delimited text file and has a size of bigger than 2 GB (approx: 2.4 GB) The uncompressed size is ~13GB
> The impalad version is : 2.12.0-cdh5.15.0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org