You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "William Hyun (Jira)" <ji...@apache.org> on 2022/09/03 22:44:01 UTC
[jira] [Closed] (ORC-1136) Optimize reads by combining multiple reads without significant separation into a single read

     [ https://issues.apache.org/jira/browse/ORC-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

William Hyun closed ORC-1136.
-----------------------------

> Optimize reads by combining multiple reads without significant separation into a single read
> --------------------------------------------------------------------------------------------
>
>                 Key: ORC-1136
>                 URL: https://issues.apache.org/jira/browse/ORC-1136
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java
>    Affects Versions: 1.7.3
>            Reporter: Pavan Lanka
>            Assignee: Pavan Lanka
>            Priority: Major
>             Fix For: 1.8.0
>
>         Attachments: benchmark.png, orc_read.png, seekvsread.png
>
>
> h2. Background
> We are moving our workloads from HDFS to AWS S3. As part of this activity we wanted to understand the performance
> characteristics and costs of using S3.
> h3. Seek vs Read
> One particular scenario that stood out in our performance testing was Seek vs Read when dealing with S3.
> In this test we are trying to read through a file
>  * Seek to Point A in the file read X bytes
>  * Move to Point B in the file that is A + X + Y
>  * This is accomplished as another seek or as a read
>  * We will leave Y variable to determine when this is best
>  * Read X bytes
> !seekvsread.png!
> Observations:
>  * We could clearly see that a read is more performant than seek when dealing with steps/gaps smaller than 4 MB.
>  ** At 4 MB read is faster by ~ 11%
>  ** At 1 MB read is faster by ~ 20%
>  * Reads are also cheaper as we perform a single GET instead of multiple GETs from [AWS S3 Pricing|https://aws.amazon.com/s3/pricing/]
>  ** Cost for GET: $0.0004
>  ** Cost for Data Retrieval to the same region AWS EKS: $0.0000
> h3. ORC Read
> Based on the above performance penalty when dealing with multiple seeks over small gaps, we measured the performance of
> ORC read on a file.
> File details:
>  * Size: ~ 21 MB
>  * Column Count: ~ 400
>  * Row Count: ~ 65K
> !orc_read.png!
> Observations:
>  * We can clearly see that we pay a significant penalty when reading alternate columns, which in the current implementation of ORC translates to multiple GET calls on AWS S3
>  * While the impact of penalty will be less significant in large reads, it will incur overheads both in terms of time and cost
> h2. Read Optimization
> The following optimizations are proposed:
>  * *orc.min.disk.seek.size* is a value in bytes: When trying to determine a single read, if the gap between two reads
> is smaller than this then it is combined into a single read.
>  * *orc.min.disk.seek.size.tolerance* is a fractional input: If the extra bytes read is greater than this fraction of
> the required bytes, then we drop the extra bytes from memory.
>  * We can further consider adding an optimization for the complete stripe in case the stripe size is smaller than
> `orc.min.disk.seek.size`
> h3. Scope
> Different types of IO takes place in ORC today.
>  * Reading of File Footer: Unchanged
>  * Reading of Stripe Footer: Unchanged
>  * Reading of Stripe Index information: Optimized
>  * Reading of Stripe Data: Optimized
> Each of the above happens at different stages of the read. The current implementation optimizes reads that happen using the {color:#287bde}DataReader{color}{color:#cc7832} {color}interface.
> This does not:
>  * Optimize the read of the file/stripe footer
>  * Reads across multiple stripes
> h2. Benchmarks
> In this benchmark we brought up an EKS Container in the same region as the AWS S3 bucket to test the performance of the patch.
> !benchmark.png!
> Observations/Details:
>  * {*}Input File details{*}:
>  ** Rows: 65536
>  ** Columns: 128
>  ** FileSize: ~ 72 MB
>  * Full Read (alternate = false)
>  ** No significant difference between the options as expected
>  * Alternate Read (alternate = true)
>  ** We get a significant boost in performance 5.8s without optimization to 1.5s with optimization giving us a time
> reduction of ~ 75 %
>  ** This also gives us a cost saving as 64 GET one for each column per stripe have been replaced with a single GET
>  ** We can see a marginal improvement ~ 3% when choosing to retain extra bytes (extraByteTolerance=10.0) as compared to
> (extraByteTolerance=0.0) which performs additional work of dropping the extra bytes from memory.
> h2. Summary
> Based on the benchmarks the following is recommended for ORC in AWS S3:
>  * `orc.min.disk.seek.size` is set to `4194304` (4 MB)
>  * `orc.min.disk.seek.size.tolerance` is set to value that is acceptable based on the memory usage constraints. When set
> to `0.0` it will always do the extra work of dropping the extra bytes.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)