You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Csaba Ringhofer (Jira)" <ji...@apache.org> on 2021/11/19 14:56:00 UTC

[jira] [Commented] (IMPALA-11014) Data is being inserted even though an INSERT INTO query fails

    [ https://issues.apache.org/jira/browse/IMPALA-11014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17446526#comment-17446526 ] 

Csaba Ringhofer commented on IMPALA-11014:
------------------------------------------

Inserts to HDFS tables are not atomic in Impala by default - the only way to make inserts really atomic is to use Hive ACID or Iceberg tables, but these were added in newer versions of Impala.

We try to make them "as atomic as possible" by writing to a staging directory and move the files to their final place with atomic moves - but if several files are created and there is an error when only a subset of them were moved then we get a partial write. Another possible issue is that in case of dynamic partitioning the creation of new partitions can fail, leading to see the moved files in existing partitions but not in new ones.

There are some cases when we don't even use staging directories, for example in S3 when query option s3_skip_insert_staging is true. (the goal is to skip the move operation as it is expensive in S3)

"MetaException: Object with id "" is managed by a different persistence manager "
This error is not familiar to me, but I expect it to come from HMS. What version of HMS is used?

>Can you suggest a workaround for this? Is it safe to assume that the data is always inserted when this particular error happens?
I am not sure - sending some parts of the Impala log could be helpful to see where does this error come from. If it comes after file moves and partition creation then the write can be considered "complete".

> Can we rely on the rows_inserted and rows_produced fields of the query in order to make assumptions about what data is inserted?
No, these are populated when we write the files, and does not tell anything about whether moving the file was succcessful.

>Can you suggest a workaround for this?
It is possible to check if the moves for an INSERT were finished by checking the staging directory in the filesystem - the name of the files are prefixed by the query ID (e.g. 194f9d029d30bb07-fb64dc3300000000_945554289_data.0.txt is created by query 194f9d029d30bb07:fb64dc3300000000) - if you see such files in the staging directory, then not all moves were finished. It is also possible to clean up files create by a given INSERT this way.

It would be great to have some counters in the profile for the intended number of moves and the actually finished ones, but I don't know of anything like this.

> Data is being inserted even though an INSERT INTO query fails
> -------------------------------------------------------------
>
>                 Key: IMPALA-11014
>                 URL: https://issues.apache.org/jira/browse/IMPALA-11014
>             Project: IMPALA
>          Issue Type: Bug
>            Reporter: Tsvetomir Palashki
>            Priority: Major
>
> We are executing an INSERT INTO query against Impala. In rare cases this query fails with the following error:
> {code:java}
> MetaException: Object with id "" is managed by a different persistence manager {code}
> Even though there is an error, the data is inserted into the table. This is particularly problematic due to our error handling logic, which refreshes the table metadata and retries the query, which causes data duplication.
> I am aware that this bug might be fixed in one of the newer Impala versions, but at this point, we are unable to upgrade.
> Can you suggest a workaround for this? Is it safe to assume that the data is always inserted when this particular error happens? Can we rely on the rows_inserted and rows_produced fields of the query in order to make assumptions about what data is inserted?
> The exact version of our Impala is:
> {code:java}
> impalad version 3.2.0-cdh6.3.2 RELEASE (build 1bb9836227301b839a32c6bc230e35439d5984ac) Built on Fri Nov 8 07:22:06 PST 2019 {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org