You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Pavan Lanka (Jira)" <ji...@apache.org> on 2021/01/25 22:34:00 UTC
[jira] [Updated] (ORC-744) LazyIO of non-filter columns
[ https://issues.apache.org/jira/browse/ORC-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Pavan Lanka updated ORC-744:
----------------------------
Description:
h2. Background
This feature request started as a result of a large search that is performed with the following characteristics:
* The search fields are not part of partition, bucket or sort fields.
* The table is a very large table.
* The predicates result in very few rows compared to the scan size.
* The search columns are a significant subset of selection columns in the query.
Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
h2. Design
This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.
On a high level the design includes the following components:
{{┌──────────────┐ ┌────────────────────────┐
│ │ │ Read │
│ │ │ │
│ │ │ ┌────────────┐ │
│SArg to Filter│─────────▶│ │Read Filter │ │
│ │ │ │ Columns │ │
│ │ │ └────────────┘ │
│ │ │ │ │
└──────────────┘ │ ▼ │
│ ┌────────────┐ │
│ │Apply Filter│ │
│ └────────────┘ │
│ │ │
│ ▼ │
│ ┌────────────┐ │
│ │Read Select │ │
│ │ Columns │ │
│ └────────────┘ │
│ │
│ │
└────────────────────────┘}}
* *SArg to Filter*: Converts Search Arguments passed down into filters for efficient application during scans.
* *Read*: Performs the lazy read using the filters.
** *Read Filter Columns*: Read the filter columns from the file.
** *Apply Filter*: Apply the filter on the read filter columns.
** *Read Select Columns*: If filter selects at least a row then read the remaining columns.
This issue has the following tasks that provides further details on the design of the respective components:
# ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
# ORC-742: LazyIO of non-filter columns
# ORC-743: Conversion of SArg to Filter
was:This is the umbrella issue that shall track all of the related issues.
> LazyIO of non-filter columns
> ----------------------------
>
> Key: ORC-744
> URL: https://issues.apache.org/jira/browse/ORC-744
> Project: ORC
> Issue Type: Improvement
> Components: Reader
> Reporter: Pavan Lanka
> Assignee: Pavan Lanka
> Priority: Major
>
> h2. Background
> This feature request started as a result of a large search that is performed with the following characteristics:
> * The search fields are not part of partition, bucket or sort fields.
> * The table is a very large table.
> * The predicates result in very few rows compared to the scan size.
> * The search columns are a significant subset of selection columns in the query.
> Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
> h2. Design
> This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.
> On a high level the design includes the following components:
>
> {{┌──────────────┐ ┌────────────────────────┐
> │ │ │ Read │
> │ │ │ │
> │ │ │ ┌────────────┐ │
> │SArg to Filter│─────────▶│ │Read Filter │ │
> │ │ │ │ Columns │ │
> │ │ │ └────────────┘ │
> │ │ │ │ │
> └──────────────┘ │ ▼ │
> │ ┌────────────┐ │
> │ │Apply Filter│ │
> │ └────────────┘ │
> │ │ │
> │ ▼ │
> │ ┌────────────┐ │
> │ │Read Select │ │
> │ │ Columns │ │
> │ └────────────┘ │
> │ │
> │ │
> └────────────────────────┘}}
> * *SArg to Filter*: Converts Search Arguments passed down into filters for efficient application during scans.
> * *Read*: Performs the lazy read using the filters.
> ** *Read Filter Columns*: Read the filter columns from the file.
> ** *Apply Filter*: Apply the filter on the read filter columns.
> ** *Read Select Columns*: If filter selects at least a row then read the remaining columns.
>
> This issue has the following tasks that provides further details on the design of the respective components:
> # ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
> # ORC-742: LazyIO of non-filter columns
> # ORC-743: Conversion of SArg to Filter
--
This message was sent by Atlassian Jira
(v8.3.4#803005)