You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Panagiotis Garefalakis (Jira)" <ji...@apache.org> on 2021/03/12 18:41:00 UTC
[jira] [Commented] (ORC-744) LazyIO of non-filter columns
[ https://issues.apache.org/jira/browse/ORC-744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300546#comment-17300546 ]
Panagiotis Garefalakis commented on ORC-744:
--------------------------------------------
Hey [~planka] I can see a few open PRs here -- what the next one to look at? ORC-759?
> LazyIO of non-filter columns
> ----------------------------
>
> Key: ORC-744
> URL: https://issues.apache.org/jira/browse/ORC-744
> Project: ORC
> Issue Type: Improvement
> Components: Reader
> Reporter: Pavan Lanka
> Assignee: Pavan Lanka
> Priority: Major
> Attachments: image-2021-01-25-14-34-45-375.png
>
>
> h2. Background
> This feature request started as a result of a large search that is performed with the following characteristics:
> * The search fields are not part of partition, bucket or sort fields.
> * The table is a very large table.
> * The predicates result in very few rows compared to the scan size.
> * The search columns are a significant subset of selection columns in the query.
> Initial analysis showed that we could have a significant benefit by lazily reading the non-search columns only when we have a match. We explore the design and some benchmarks in subsequent sections.
> h2. Design
> This builds further on ORC-577 which currently only restricts deserialization for some selected data types but does not improve on IO.
> On a high level the design includes the following components:
> !image-2021-01-25-14-34-45-375.png!
> * *SArg to Filter*: Converts Search Arguments passed down into filters for efficient application during scans.
> * *Read*: Performs the lazy read using the filters.
> ** *Read Filter Columns*: Read the filter columns from the file.
> ** *Apply Filter*: Apply the filter on the read filter columns.
> ** *Read Select Columns*: If filter selects at least a row then read the remaining columns.
>
> This issue has the following tasks that provides further details on the design of the respective components:
> # ORC-741: Bug fix related to schema evolution of missing columns in the presence of filters
> # ORC-742: LazyIO of non-filter columns
> # ORC-743: Conversion of SArg to Filter
>
> h2. Tests
> We evaluated this approach against a search job with the following stats:
> * Table
> ** Size: ~*420 TB*
> ** Data fields: ~*120*
> ** Partition fields: *3*
> * Scan
> ** Search fields: 3 data fields with large (~ 1000 value) IN clauses compounded by *OR*.
> ** Select fields: 16 data fields (includes the 3 search fields), 1 partition field
> ** Search:
> *** Size: ~*180 TB*
> *** Records: *3.99 T*
> ** Selected:
> *** Size: ~*100 MB*
> *** Records: *1 M*
> We have observed the following reductions compared with the absence of the patch:
> ||Test||IO Reduction %||CPU Reduction %||
> |Select 16 columns|45|47|
> |SELECT *|70|87|
> * The savings are more significant as you increase the number of select columns with respect to the search columns
> * When the filter selects most data, no significant penalty observed as a result of 2 IO compared with a single IO
> ** We do have a penalty as a result of the filter application on the selected records.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)