You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "Ayush Saxena (Jira)" <ji...@apache.org> on 2022/12/14 21:41:00 UTC

[jira] [Comment Edited] (HIVE-26699) Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX

    [ https://issues.apache.org/jira/browse/HIVE-26699?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17647708#comment-17647708 ] 

Ayush Saxena edited comment on HIVE-26699 at 12/14/22 9:40 PM:
---------------------------------------------------------------

{quote}is hive hadoop 3.3.x + only yet?
{quote}
yes, we are at 3.3.1.

[~rajesh.balamohan] I couldn't find a standard way,  where I can set this value in the conf and make sure this value is used for every Iceberg metadata read and only in that case,  I had doubts setting it in couple of places, like if the conf is shared or something like that, if it will get used at some other place as well, where we don't intend to do so.

But I tried a draft approach, using the openFile API, it is a bit  hacky for hive, but thats what I could think as of now. The main change is here:
[https://github.com/apache/hive/pull/3862/files#diff-661ab0f0af817370c70a7320b3cf51d3b0ff690f6a74aa97765bb0c819a550bbR181-R184]

I ditched the instanceof check, thinking what this config can harm even in other filesystems and checking FS vs setting this should have same cost and instance of might not be very correct in case of ViewFs or so. But I can do that, if you feel so. 

just fyi. the core iceberg seems to be on hadoop-2 line only if I got it right:
[https://github.com/apache/iceberg/blob/master/versions.props#L4]

Let me know if this approach can work for now or I will try to discuss with folks and see if we can find some other route.


was (Author: ayushtkn):
{quote}is hive hadoop 3.3.x + only yet?
{quote}
yes, we are at 3.3.1.

[~rajesh.balamohan] I couldn't find a standard way where I can set in the conf this value for Iceberg metadata read only, And I had doubts setting it in couple of places, like if the conf is shared or something like that, if it will get used at some other place as well, where we don't intend to do so.

But I tried a draft approach, using the openFile API, it is a bit a hacky for hive, but thats what I could think as of now.
[https://github.com/apache/hive/pull/3862/files#diff-661ab0f0af817370c70a7320b3cf51d3b0ff690f6a74aa97765bb0c819a550bbR181-R184]

I ditched the instanceof check, thinking what this config can harm even in other filesystems and checking FS vs setting this should have same cost and instance of might not be very correct in case of ViewFs or so.

just fyi. the core iceberg seems to be on hadoop-2 line only if I got it right:
[https://github.com/apache/iceberg/blob/master/versions.props#L4]

Let me know if this approach can work for now or I will try to discuss with folks and see if we can find some other route.

> Iceberg: S3 fadvise can hurt JSON parsing significantly in DWX
> --------------------------------------------------------------
>
>                 Key: HIVE-26699
>                 URL: https://issues.apache.org/jira/browse/HIVE-26699
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Hive reads JSON metadata information (TableMetadataParser::read()) multiple times; E.g during query compilation, AM split computation, stats computation, during commits  etc.
>  
> With large JSON files (due to multiple inserts), it takes a lot longer time with S3 FS with "fs.s3a.experimental.input.fadvise" set to "random". (e.g in the order of 10x).To be on safer side, it will be good to set this to "normal" mode in configs, when reading iceberg tables.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)