You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Pavan Lanka (Jira)" <ji...@apache.org> on 2021/01/25 22:34:00 UTC
[jira] [Updated] (ORC-744) LazyIO of non-filter columns

     [ https://issues.apache.org/jira/browse/ORC-744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavan Lanka updated ORC-744:
----------------------------
    Description: 
h2. Background

This feature request started as a result of a large search that is performed with the following characteristics:
 * The search fields are not part of partition, bucket or sort fields.
 * The table is a very large table.
 * The predicates result in very few rows compared to the scan size.
 * The search columns are a significant subset of selection columns in the query.

Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
h2. Design

This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.

On a high level the design includes the following components:

 

{{┌──────────────┐          ┌────────────────────────┐
│              │          │          Read          │
│              │          │                        │
│              │          │     ┌────────────┐     │
│SArg to Filter│─────────▶│     │Read Filter │     │
│              │          │     │  Columns   │     │
│              │          │     └────────────┘     │
│              │          │            │           │
└──────────────┘          │            ▼           │
                          │     ┌────────────┐     │
                          │     │Apply Filter│     │
                          │     └────────────┘     │
                          │            │           │
                          │            ▼           │
                          │     ┌────────────┐     │
                          │     │Read Select │     │
                          │     │  Columns   │     │
                          │     └────────────┘     │
                          │                        │
                          │                        │
                          └────────────────────────┘}}
 * *SArg to Filter*: Converts Search Arguments passed down into filters for efficient application during scans.
 * *Read*: Performs the lazy read using the filters.
 ** *Read Filter Columns*: Read the filter columns from the file.
 ** *Apply Filter*: Apply the filter on the read filter columns.
 ** *Read Select Columns*: If filter selects at least a row then read the remaining columns.

 

This issue has the following tasks that provides further details on the design of the respective components:
 # ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
 # ORC-742: LazyIO of non-filter columns
 # ORC-743: Conversion of SArg to Filter

  was:This is the umbrella issue that shall track all of the related issues.


> LazyIO of non-filter columns
> ----------------------------
>
>                 Key: ORC-744
>                 URL: https://issues.apache.org/jira/browse/ORC-744
>             Project: ORC
>          Issue Type: Improvement
>          Components: Reader
>            Reporter: Pavan Lanka
>            Assignee: Pavan Lanka
>            Priority: Major
>
> h2. Background
> This feature request started as a result of a large search that is performed with the following characteristics:
>  * The search fields are not part of partition, bucket or sort fields.
>  * The table is a very large table.
>  * The predicates result in very few rows compared to the scan size.
>  * The search columns are a significant subset of selection columns in the query.
> Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
> h2. Design
> This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.
> On a high level the design includes the following components:
>  
> {{┌──────────────┐          ┌────────────────────────┐
> │              │          │          Read          │
> │              │          │                        │
> │              │          │     ┌────────────┐     │
> │SArg to Filter│─────────▶│     │Read Filter │     │
> │              │          │     │  Columns   │     │
> │              │          │     └────────────┘     │
> │              │          │            │           │
> └──────────────┘          │            ▼           │
>                           │     ┌────────────┐     │
>                           │     │Apply Filter│     │
>                           │     └────────────┘     │
>                           │            │           │
>                           │            ▼           │
>                           │     ┌────────────┐     │
>                           │     │Read Select │     │
>                           │     │  Columns   │     │
>                           │     └────────────┘     │
>                           │                        │
>                           │                        │
>                           └────────────────────────┘}}
>  * *SArg to Filter*: Converts Search Arguments passed down into filters for efficient application during scans.
>  * *Read*: Performs the lazy read using the filters.
>  ** *Read Filter Columns*: Read the filter columns from the file.
>  ** *Apply Filter*: Apply the filter on the read filter columns.
>  ** *Read Select Columns*: If filter selects at least a row then read the remaining columns.
>  
> This issue has the following tasks that provides further details on the design of the respective components:
>  # ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
>  # ORC-742: LazyIO of non-filter columns
>  # ORC-743: Conversion of SArg to Filter



--
This message was sent by Atlassian Jira
(v8.3.4#803005)