You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Pavan Lanka (Jira)" <ji...@apache.org> on 2022/03/26 00:01:00 UTC

[jira] [Updated] (ORC-1136) Optimize reads by combining multiple reads without significant separation into a single read

     [ https://issues.apache.org/jira/browse/ORC-1136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pavan Lanka updated ORC-1136:
-----------------------------
     Attachment: benchmark.png
                 orc_read.png
    Description: 
h2. Background

We are moving our workloads from HDFS to AWS S3. As part of this activity we wanted to understand the performance
characteristics and costs of using S3.
h3. Seek vs Read

One particular scenario that stood out in our performance testing was Seek vs Read when dealing with S3.

In this test we are trying to read through a file
 * Seek to Point A in the file read X bytes
 * Move to Point B in the file that is A + X + Y
 * This is accomplished as another seek or as a read
 * We will leave Y variable to determine when this is best
 * Read X bytes

!orc_read.png!

Observations:
 * We could clearly see that a read is more performant than seek when dealing with steps/gaps smaller than 4 MB.
 ** At 4 MB read is faster by ~ 11%
 ** At 1 MB read is faster by ~ 20%
 * Reads are also cheaper as we perform a single GET instead of multiple GETs from [AWS S3 Pricing|https://aws.amazon.com/s3/pricing/]
 ** Cost for GET: $0.0004
 ** Cost for Data Retrieval to the same region AWS EKS: $0.0000

h3. ORC Read

Based on the above performance penalty when dealing with multiple seeks over small gaps, we measured the performance of
ORC read on a file.

File details:
 * Size: ~ 21 MB
 * Column Count: ~ 400
 * Row Count: ~ 65K

!benchmark.png!

Observations:
 * We can clearly see that we pay a significant penalty when reading alternate columns, which in the current implementation of ORC translates to multiple GET calls on AWS S3
 * While the impact of penalty will be less significant in large reads, it will incur overheads both in terms of time and cost

h2. Read Optimization

The following optimizations are proposed:
 * *orc.min.disk.seek.size* is a value in bytes: When trying to determine a single read, if the gap between two reads
is smaller than this then it is combined into a single read.
 * *orc.min.disk.seek.size.tolerance* is a fractional input: If the extra bytes read is greater than this fraction of
the required bytes, then we drop the extra bytes from memory.
 * We can further consider adding an optimization for the complete stripe in case the stripe size is smaller than
`orc.min.disk.seek.size`

h3. Scope

Different types of IO takes place in ORC today.
 * Reading of File Footer: Unchanged
 * Reading of Stripe Footer: Unchanged
 * Reading of Stripe Index information: Optimized
 * Reading of Stripe Data: Optimized

Each of the above happens at different stages of the read. The current implementation optimizes reads that happen using the {color:#287bde}DataReader{color}{color:#cc7832} {color}interface.

This does not:
 * Optimize the read of the file/stripe footer
 * Reads across multiple stripes

h2. Benchmarks

In this benchmark we brought up an EKS Container in the same region as the AWS S3 bucket to test the performance of the patch.

!benchmark.png!

Observations/Details:
 * {*}Input File details{*}:
 ** Rows: 65536
 ** Columns: 128
 ** FileSize: ~ 72 MB
 * Full Read (alternate = false)
 ** No significant difference between the options as expected
 * Alternate Read (alternate = true)
 ** We get a significant boost in performance 5.8s without optimization to 1.5s with optimization giving us a time
reduction of ~ 75 %
 ** This also gives us a cost saving as 64 GET one for each column per stripe have been replaced with a single GET
 ** We can see a marginal improvement ~ 3% when choosing to retain extra bytes (extraByteTolerance=10.0) as compared to
(extraByteTolerance=0.0) which performs additional work of dropping the extra bytes from memory.

h2. Summary

Based on the benchmarks the following is recommended for ORC in AWS S3:
 * `orc.min.disk.seek.size` is set to `4194304` (4 MB)
 * `orc.min.disk.seek.size.tolerance` is set to value that is acceptable based on the memory usage constraints. When set
to `0.0` it will always do the extra work of dropping the extra bytes.

  was:
h2. Background

We are moving our workloads from HDFS to AWS S3. As part of this activity we wanted to understand the performance
characteristics and costs of using S3.
h3. Seek vs Read

One particular scenario that stood out in our performance testing was Seek vs Read when dealing with S3.

In this test we are trying to read through a file
 * Seek to Point A in the file read X bytes
 * Move to Point B in the file that is A + X + Y
 * This is accomplished as another seek or as a read
 * We will leave Y variable to determine when this is best
 * Read X bytes

!seekvsread.png!

Observations:
 * We could clearly see that a read is more performant than seek when dealing with steps/gaps smaller than 4 MB.
 ** At 4 MB read is faster by ~ 11%
 ** At 1 MB read is faster by ~ 20%
 * Reads are also cheaper as we perform a single GET instead of multiple GETs from [AWS S3 Pricing|https://aws.amazon.com/s3/pricing/]
 ** Cost for GET: $0.0004
 ** Cost for Data Retrieval to the same region AWS EKS: $0.0000

h3. ORC Read

Based on the above performance penalty when dealing with multiple seeks over small gaps, we measured the performance of
ORC read on a file.

File details:
 * Size: ~ 21 MB
 * Column Count: ~ 400
 * Row Count: ~ 65K

!image-2022-03-25-16-45-00-692.png!

Observations:
 * We can clearly see that we pay a significant penalty when reading alternate columns, which in the current implementation of ORC translates to multiple GET calls on AWS S3
 * While the impact of penalty will be less significant in large reads, it will incur overheads both in terms of time and cost

h2. Read Optimization

The following optimizations are proposed:
 * *orc.min.disk.seek.size* is a value in bytes: When trying to determine a single read, if the gap between two reads
is smaller than this then it is combined into a single read.
 * *orc.min.disk.seek.size.tolerance* is a fractional input: If the extra bytes read is greater than this fraction of
the required bytes, then we drop the extra bytes from memory.
 * We can further consider adding an optimization for the complete stripe in case the stripe size is smaller than
`orc.min.disk.seek.size`

h3. Scope

Different types of IO takes place in ORC today.
 * Reading of File Footer: Unchanged
 * Reading of Stripe Footer: Unchanged
 * Reading of Stripe Index information: Optimized
 * Reading of Stripe Data: Optimized

Each of the above happens at different stages of the read. The current implementation optimizes reads that happen using the {color:#287bde}DataReader{color}{color:#cc7832} {color}interface.

This does not:
 * Optimize the read of the file/stripe footer
 * Reads across multiple stripes

h2. Benchmarks

In this benchmark we brought up an EKS Container in the same region as the AWS S3 bucket to test the performance of the patch.

!benchmark.png!

Observations/Details:
 * {*}Input File details{*}:
 ** Rows: 65536
 ** Columns: 128
 ** FileSize: ~ 72 MB
 * Full Read (alternate = false)
 ** No significant difference between the options as expected
 * Alternate Read (alternate = true)
 ** We get a significant boost in performance 5.8s without optimization to 1.5s with optimization giving us a time
reduction of ~ 75 %
 ** This also gives us a cost saving as 64 GET one for each column per stripe have been replaced with a single GET
 ** We can see a marginal improvement ~ 3% when choosing to retain extra bytes (extraByteTolerance=10.0) as compared to
(extraByteTolerance=0.0) which performs additional work of dropping the extra bytes from memory.

h2. Summary

Based on the benchmarks the following is recommended for ORC in AWS S3:
 * `orc.min.disk.seek.size` is set to `4194304` (4 MB)
 * `orc.min.disk.seek.size.tolerance` is set to value that is acceptable based on the memory usage constraints. When set
to `0.0` it will always do the extra work of dropping the extra bytes.


> Optimize reads by combining multiple reads without significant separation into a single read
> --------------------------------------------------------------------------------------------
>
>                 Key: ORC-1136
>                 URL: https://issues.apache.org/jira/browse/ORC-1136
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java
>    Affects Versions: 1.7.3
>            Reporter: Pavan Lanka
>            Assignee: Pavan Lanka
>            Priority: Major
>             Fix For: 1.8.0
>
>         Attachments: benchmark.png, orc_read.png, seekvsread.png
>
>
> h2. Background
> We are moving our workloads from HDFS to AWS S3. As part of this activity we wanted to understand the performance
> characteristics and costs of using S3.
> h3. Seek vs Read
> One particular scenario that stood out in our performance testing was Seek vs Read when dealing with S3.
> In this test we are trying to read through a file
>  * Seek to Point A in the file read X bytes
>  * Move to Point B in the file that is A + X + Y
>  * This is accomplished as another seek or as a read
>  * We will leave Y variable to determine when this is best
>  * Read X bytes
> !orc_read.png!
> Observations:
>  * We could clearly see that a read is more performant than seek when dealing with steps/gaps smaller than 4 MB.
>  ** At 4 MB read is faster by ~ 11%
>  ** At 1 MB read is faster by ~ 20%
>  * Reads are also cheaper as we perform a single GET instead of multiple GETs from [AWS S3 Pricing|https://aws.amazon.com/s3/pricing/]
>  ** Cost for GET: $0.0004
>  ** Cost for Data Retrieval to the same region AWS EKS: $0.0000
> h3. ORC Read
> Based on the above performance penalty when dealing with multiple seeks over small gaps, we measured the performance of
> ORC read on a file.
> File details:
>  * Size: ~ 21 MB
>  * Column Count: ~ 400
>  * Row Count: ~ 65K
> !benchmark.png!
> Observations:
>  * We can clearly see that we pay a significant penalty when reading alternate columns, which in the current implementation of ORC translates to multiple GET calls on AWS S3
>  * While the impact of penalty will be less significant in large reads, it will incur overheads both in terms of time and cost
> h2. Read Optimization
> The following optimizations are proposed:
>  * *orc.min.disk.seek.size* is a value in bytes: When trying to determine a single read, if the gap between two reads
> is smaller than this then it is combined into a single read.
>  * *orc.min.disk.seek.size.tolerance* is a fractional input: If the extra bytes read is greater than this fraction of
> the required bytes, then we drop the extra bytes from memory.
>  * We can further consider adding an optimization for the complete stripe in case the stripe size is smaller than
> `orc.min.disk.seek.size`
> h3. Scope
> Different types of IO takes place in ORC today.
>  * Reading of File Footer: Unchanged
>  * Reading of Stripe Footer: Unchanged
>  * Reading of Stripe Index information: Optimized
>  * Reading of Stripe Data: Optimized
> Each of the above happens at different stages of the read. The current implementation optimizes reads that happen using the {color:#287bde}DataReader{color}{color:#cc7832} {color}interface.
> This does not:
>  * Optimize the read of the file/stripe footer
>  * Reads across multiple stripes
> h2. Benchmarks
> In this benchmark we brought up an EKS Container in the same region as the AWS S3 bucket to test the performance of the patch.
> !benchmark.png!
> Observations/Details:
>  * {*}Input File details{*}:
>  ** Rows: 65536
>  ** Columns: 128
>  ** FileSize: ~ 72 MB
>  * Full Read (alternate = false)
>  ** No significant difference between the options as expected
>  * Alternate Read (alternate = true)
>  ** We get a significant boost in performance 5.8s without optimization to 1.5s with optimization giving us a time
> reduction of ~ 75 %
>  ** This also gives us a cost saving as 64 GET one for each column per stripe have been replaced with a single GET
>  ** We can see a marginal improvement ~ 3% when choosing to retain extra bytes (extraByteTolerance=10.0) as compared to
> (extraByteTolerance=0.0) which performs additional work of dropping the extra bytes from memory.
> h2. Summary
> Based on the benchmarks the following is recommended for ORC in AWS S3:
>  * `orc.min.disk.seek.size` is set to `4194304` (4 MB)
>  * `orc.min.disk.seek.size.tolerance` is set to value that is acceptable based on the memory usage constraints. When set
> to `0.0` it will always do the extra work of dropping the extra bytes.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)