You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues-all@impala.apache.org by "Tim Armstrong (Jira)" <ji...@apache.org> on 2019/12/10 01:33:00 UTC
[jira] [Commented] (IMPALA-9224) Blacklist nodes with faulty disks

    [ https://issues.apache.org/jira/browse/IMPALA-9224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992092#comment-16992092 ] 

Tim Armstrong commented on IMPALA-9224:
---------------------------------------

We looked at doing this locally ages ago - IMPALA-4683 was filed back then and is probably subsumed by this.  There was also some efforts a while back that I think [~gaborkaszab] did to identify all the errors that might be returned: https://github.com/apache/impala/blob/257fa0c68bb4e64880a64844d8d4023c54645230/be/src/runtime/io/error-converter.cc#L30.

> Blacklist nodes with faulty disks
> ---------------------------------
>
>                 Key: IMPALA-9224
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9224
>             Project: IMPALA
>          Issue Type: Improvement
>          Components: Backend
>            Reporter: Sahil Takiar
>            Priority: Critical
>
> Similar to IMPALA-8339 and IMPALA-9137, Impala should blacklist nodes with faulty disks. Specifically, if a query fails because of a disk error, the node with that disk should be blacklisted and the query should be retried.
> We shouldn't need to blacklist nodes that fail to read from HDFS / S3, since they contain their own internal mechanisms for recovering from faulty disks. We should only blacklist nodes when failing to read / write from *local* disks.
> The two main components of Impala that read / write from local disk are the spill-to-disk and data caching features. Whenever a query fails because of a disk failure during spill-to-disk, the node should be blacklisted.
> Reads / writes from / to the data cache are a bit different. If a cache read fails due to a disk error, the error will be printed out and the Lookup() call to the cache will return 0 bytes read, which means it couldn't find the data in the cache. This should cause the scan to fall back to a normal, un-cached read. While this doesn't affect query correctness or the ability for a query to complete, it can affect performance. Since cache failures don't result in query failures, we might consider having a threshold of data cache read / writes errors before blacklisting a node.
> We need to be careful to only capture specific disk failures - e.g. disk quota, permission denied, etc. errors shouldn't result in blacklisting as they typically are a result of system misconfiguration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org