You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues-all@impala.apache.org by "Michael Ho (JIRA)" <ji...@apache.org> on 2019/03/25 05:55:00 UTC

[jira] [Created] (IMPALA-8341) Data cache for remote reads

Michael Ho created IMPALA-8341:
----------------------------------

             Summary: Data cache for remote reads
                 Key: IMPALA-8341
                 URL: https://issues.apache.org/jira/browse/IMPALA-8341
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
    Affects Versions: Impala 3.2.0
            Reporter: Michael Ho
            Assignee: Michael Ho


When running in public cloud (e.g. AWS with S3) or in certain private cloud settings (e.g. data stored in object store), the computation and storage are no longer co-located. This breaks the typical pattern in which Impala query fragment instances are scheduled at where the data is located. In this setting, the network bandwidth requirement of both the nics and the top of rack switches will go up quite a lot as the network traffic includes the data fetch in addition to the shuffling exchange traffic of intermediate results.

To mitigate the pressure on the network, one can build a storage backed cache at the compute nodes to cache the working set. With deterministic scan range scheduling, each compute node should hold non-overlapping partitions of the data set. 

A prototype of the cache was posted here: https://gerrit.cloudera.org/#/c/12683/



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-all-unsubscribe@impala.apache.org
For additional commands, e-mail: issues-all-help@impala.apache.org