You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hawq.apache.org by "Lei Chang (JIRA)" <ji...@apache.org> on 2016/07/14 01:02:20 UTC
[jira] [Updated] (HAWQ-923) More data skipping optimization for IO
intensive performance enhancement
[ https://issues.apache.org/jira/browse/HAWQ-923?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lei Chang updated HAWQ-923:
---------------------------
Summary: More data skipping optimization for IO intensive performance enhancement (was: More data skipping technology for IO intensive performance enhancement)
> More data skipping optimization for IO intensive performance enhancement
> -------------------------------------------------------------------------
>
> Key: HAWQ-923
> URL: https://issues.apache.org/jira/browse/HAWQ-923
> Project: Apache HAWQ
> Issue Type: Wish
> Components: Query Execution
> Reporter: Ming LI
> Assignee: Lei Chang
> Fix For: backlog
>
>
> see email discussion here: http://mail-archives.apache.org/mod_mbox/hawq-dev/201607.mbox/%3CCA+F1uf=TjCiOezkvpHSPpAOG-jg0-0AzqTUsgr7RV+EsV44kFQ@mail.gmail.com%3E
> Data skipping technology can extremely avoiding unnecessary IO, so it can
> extremely enhance performance for IO intensive query. Including eliminating
> query on unnecessary table partition according to the partition key range ,
> I think more options are available now:
> (1) Parquet / ORC format introduce a lightweight meta data info like
> Min/Max/Bloom filter for each block, such meta data can be exploited when
> predicate/filter info can be fetched before executing scan.
> However now in HAWQ, all data in parquet need to be scanned into memory
> before processing predicate/filter. We don't generate the meta info when
> INSERT into parquet table, the scan executor doesn't utilize the meta info
> neither. Maybe some scan API need to be refactored so that we can get
> predicate/filter
> info before executing base relation scan.
> (2) Base on (1) technology, especially with Bloom filter, more optimizer
> technology can be explored furthur. E.g. Impala implemented Runtime
> filtering(*https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html
> <https://www.cloudera.com/documentation/enterprise/latest/topics/impala_runtime_filtering.html>*
> ), which can be used at
> - dynamic partition pruning
> - converting join predicate to base relation predicate
> It tell the executor to wait for one moment(the interval time can be set in
> guc) before executing base relation scan, if the interested values(e.g. the
> column in join predicate only have very small set) arrived in time, it can
> use these value to filter this scan, if doesn't arrived in time, it scan
> without this filter, which doesn't impact result correctness.
> Unlike (1) technology, this technology cannot be used in any case, it only
> outperform in some cases. So it just add some more query plan
> choices/paths, and the optimizer need based on statistics info to calculate
> the cost, and apply it when cost down.
> All in one, maybe more similar technology can be adoptable for HAWQ now,
> let's start to think about performance related technology, moreover we need
> to instigate how these technology can be implemented in HAWQ.
> Any ideas or suggestions are welcomed? Thanks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)