You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Steve Loughran (Jira)" <ji...@apache.org> on 2024/03/06 11:11:00 UTC

[jira] [Updated] (HADOOP-19101) Vectored Read into off-heap buffer broken in fallback implementation

     [ https://issues.apache.org/jira/browse/HADOOP-19101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Steve Loughran updated HADOOP-19101:
------------------------------------
    Description: 
{{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at position zero even when the range is at a different offset. As a result: you can get incorrect information.

Thanks for this is straightforward: we pass in a FileRange and use its offset as the starting position.

However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we have never seen this in production because the parquet and ORC libraries both read into on-heap storage.

Those libraries needs to be audited to make sure that they never attempt to read into off-heap DirectBuffers. This is a bit trickier than you would think because an allocator is passed in. For PARQUET-2171 we will 
* only invoke the API on streams which explicitly declare their support for the API (so fallback in parquet itself)
* not invoke when direct buffer allocation is in use.

  was:

{{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at position zero even when the range is at a different offset. As a result: you can get incorrect information.

Thanks for this is straightforward: we pass in a FileRange and use its offset as the starting position.

However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely read vectorIO into direct buffers through HDFS, ABFS or Azure. Note that we have never seen this in production because the parquet and ORC libraries both read into on-heap storage.

Those libraries needs to be audited to make sure that they never attempt to read into off-heap DirectBuffers. This is a bit trickier than you would think because an allocator is passed in. For PARQUET-2171 we will 
* only invoke the API on streams which explicitly declare their support for the API (so fallback in parquet itself)
* not invoke when direct buffer allocation is in use.


> Vectored Read into off-heap buffer broken in fallback implementation
> --------------------------------------------------------------------
>
>                 Key: HADOOP-19101
>                 URL: https://issues.apache.org/jira/browse/HADOOP-19101
>             Project: Hadoop Common
>          Issue Type: Sub-task
>          Components: fs, fs/azure
>    Affects Versions: 3.4.0, 3.3.6
>            Reporter: Steve Loughran
>            Assignee: Steve Loughran
>            Priority: Blocker
>
> {{VectoredReadUtils.readInDirectBuffer()}} always starts off reading at position zero even when the range is at a different offset. As a result: you can get incorrect information.
> Thanks for this is straightforward: we pass in a FileRange and use its offset as the starting position.
> However, this does mean that all shipping releases 3.3.5-3.4.0 cannot safely read vectorIO into direct buffers through HDFS, ABFS or GCS. Note that we have never seen this in production because the parquet and ORC libraries both read into on-heap storage.
> Those libraries needs to be audited to make sure that they never attempt to read into off-heap DirectBuffers. This is a bit trickier than you would think because an allocator is passed in. For PARQUET-2171 we will 
> * only invoke the API on streams which explicitly declare their support for the API (so fallback in parquet itself)
> * not invoke when direct buffer allocation is in use.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org