You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@impala.apache.org by "Luis E Martinez-Poblete (JIRA)" <ji...@apache.org> on 2017/09/15 18:14:00 UTC

[jira] [Created] (IMPALA-5943) Add parameter to disable "alter table change" command by default

Luis E Martinez-Poblete created IMPALA-5943:
-----------------------------------------------

             Summary: Add parameter to disable "alter table change" command by default
                 Key: IMPALA-5943
                 URL: https://issues.apache.org/jira/browse/IMPALA-5943
             Project: IMPALA
          Issue Type: Improvement
          Components: Frontend
    Affects Versions: Impala 2.8.0
            Reporter: Luis E Martinez-Poblete


It is possible to corrupt parquet files after executing the "alter table change" command. Please consider the following scenario:

create table mytest (c1 int) stored as parquet;

insert into table mytest values (1);

ALTER TABLE mytest CHANGE c1 c1	bigint	COMMENT "change data type";

insert into mytest values(999999999999);

select * from mytest;
--query fails with error 
-- File 'hdfs://xxxxx:8020/user/hive/warehouse/mytest/1b418f638de8221c-f1d456f100000000_1174562035_data.0.parq' has an incompatible Parquet schema for column 'default.mytest.c1'. Column type: BIGINT, Parquet schema: optional int32 c1 [i:0 d:1 r:0]

ALTER TABLE mytest CHANGE c1 c1	int	COMMENT "change data type";

select * from mytest;
--query fails with error : Bad status for request TFetchResultsReq(fetchType=0, operationHandle=TOperationHandle(hasResultSet=True, modifiedRowCount=None, operationType=0, operationId=THandleIdentifier(secret='\xb3\x1a\xda9\xaaJH\xfe\x00\x00\x00\x00Pe\xd3t', guid='\xb3\x1a\xda9\xaaJH\xfe\x00\x00\x00\x00Pe\xd3t')), orientation=4, maxRows=100): TFetchResultsResp(status=TStatus(errorCode=None, errorMessage="File 'hdfs://xxxx:8020/user/hive/warehouse/mytest/f440e026e86f2d54-abfc55600000000_1916518758_data.0.parq' has an incompatible Parquet schema for column 'default.mytest.c1'. Column type: INT, Parquet schema:\noptional int64 c1 [i:0 d:1 r:0]\n", sqlState='HY000', infoMessages=None, statusCode=3), results=None, hasMoreRows=None)

Documentation is very clear about this behavior, but it doesn't mention that it can potentially corrupt parquet files:

================
"You cannot change a TINYINT, SMALLINT, or INT column to BIGINT, or the other way around. Although the ALTER TABLE succeeds, any attempt to query those columns results in conversion errors."

"Changing the type of a column works if existing data values can be safely converted to the new type. The type conversion rules depend on the file format of the underlying table. For example, in a text table, the same value can be interpreted as a STRING or a numeric value, while in a binary format such as Parquet, the rules are stricter and type conversions only work between certain sizes of integers.

....

Remember that Impala does not actually do any conversion for the underlying data files as a result of ALTER TABLE statements. If you use ALTER TABLE to create a table layout that does not agree with the contents of the underlying files, you must replace the files yourself, such as using LOAD DATA to load a new set of data files, or INSERT OVERWRITE to copy from another table and replace the original data."
================

In reality most people will not read the documentation and probably run into this situation. Is it possible to add a self protection mechanism?  It would be nice to have a parameter to disable "alter table change" command when the underlying table has a parquet file. This could be disable by default and it can changed it after understanding the implications of the "alter table change" command.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)