You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2021/02/04 18:27:00 UTC

[jira] [Commented] (IMPALA-9224) Blacklist nodes with faulty disks

    [ https://issues.apache.org/jira/browse/IMPALA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279036#comment-17279036 ] 

ASF subversion and git services commented on IMPALA-9224:
---------------------------------------------------------

Commit b5e2a0ce2ed34dc12a47da23ec2adf65a2f60c0a in impala's branch refs/heads/master from wzhou-code
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b5e2a0c ]

IMPALA-9224: Blacklist nodes with faulty disk for spilling

This patch extends blacklist functionality by adding executor node to
blacklist if a query fails caused by disk failure during spill-to-disk.
Also classifies disk error codes and defines a blacklistable error set
for non-transient disk errors. Coordinator blacklists executor only if
the executor hitted blacklistable error during spill-to-disk.

Adds a new debug action to simulate disk write error during spill-to-
disk. To use, specify in query options as:
  'debug_action': 'IMPALA_TMP_FILE_WRITE:<hostname>:<port>:<action>'

  where <hostname> and <port> represent the impalad which execute the
  fragment instances, <port> is the BE krpc port (default 27000).

Adds new test cases for blacklist and query-retry to cover the code
changes.

Testing:
 - Passed new test cases.
 - Passed exhaustive test.
 - Manually simulated disk failures in scratch directories on nodes
   of a cluster, verified that the nodes were blacklisted as
   expected.

Change-Id: I04bfcb7f2e0b1ef24a5b4350f270feecd8c47437
Reviewed-on: http://gerrit.cloudera.org:8080/16949
Reviewed-by: Impala Public Jenkins <im...@cloudera.com>
Tested-by: Impala Public Jenkins <im...@cloudera.com>


> Blacklist nodes with faulty disks
> ---------------------------------
>
>                 Key: IMPALA-9224
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9224
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Backend
>            Reporter: Sahil Takiar
>            Assignee: Wenzhe Zhou
>            Priority: Critical
>
> Similar to IMPALA-8339 and IMPALA-9137, Impala should blacklist nodes with faulty disks. Specifically, if a query fails because of a disk error, the node with that disk should be blacklisted and the query should be retried.
> We shouldn't need to blacklist nodes that fail to read from HDFS / S3, since they contain their own internal mechanisms for recovering from faulty disks. We should only blacklist nodes when failing to read / write from *local* disks.
> The two main components of Impala that read / write from local disk are the spill-to-disk and data caching features. Whenever a query fails because of a disk failure during spill-to-disk, the node should be blacklisted.
> Reads / writes from / to the data cache are a bit different. If a cache read fails due to a disk error, the error will be printed out and the Lookup() call to the cache will return 0 bytes read, which means it couldn't find the data in the cache. This should cause the scan to fall back to a normal, un-cached read. While this doesn't affect query correctness or the ability for a query to complete, it can affect performance. Since cache failures don't result in query failures, we might consider having a threshold of data cache read / writes errors before blacklisting a node.
> We need to be careful to only capture specific disk failures - e.g. disk quota, permission denied, etc. errors shouldn't result in blacklisting as they typically are a result of system misconfiguration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org