You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by Anup Tiwari <an...@gmail.com> on 2022/02/18 10:42:33 UTC

Too many S3 API calls for simple queries like select and create external table

Hi Team,

We are using Hive heavily for our ETL where our data is stored on S3 and so
we have seen a strange behaviour between Hive & S3 interaction in terms of
S3 API calls i.e. Actual number of API calls for simple select statements
were too much compared to expected so let us know why it is behaving like
this because if we execute the same select statement via Athena then the
number of API calls are very less.



*Background :-*
We are incurring some S3 API cost and to understand each API call better,
we decided to do simple testing.

1. We have a non partition table containing a lot of objects in parquet
format on S3.

2. We copied one parquet file object(data) to a separate S3 bucket(target)
so now our target bucket contains one parquet file data in following
hierarchy on S3 :-
s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)

3. After that, we have executed following 3 command in Apache Hive 2.1.1
managed by us on EC2 cluster :-

(i) Create External table on top of above S3 location :-

CREATE EXTERNAL TABLE `anup.Test`(
  `id` int,
  `cname` varchar(45),
  `mef` decimal(10,3),
  `mlpr` int,
  `qperiod` int,
  `validity` int,
  `rpmult` decimal(10,3))
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://<BUCKET>/Test' ;

(ii) msck repair table Test(Just to test behaviour) ;
(iii) Simple select statement :- select * from anup.Test ;




*Results :-*
Ideally, we were *expecting max 5-10 API calls* with below breakdown


1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
create "Test/" object.
2. msck repair : 1-2 API calls ; since we have single object behind table
3. select *  : 1-2 API calls ; since we have single object behind table


But *Actual number of Total API calls was 37* and we have fetched this from
S3 Access Logs via Athena. Breakdown of these calls are as follows :-

1. Create External : 9 API calls
2. msck repair : 3 API calls
3. select *  : 25 API calls


Attaching actual results of S3 Access Logs for select command along with
DEBUG logs of Hive for select statement.

Let us know why so many API calls are happening for the Create External /
select statement because if we execute the same select statement *via
Athena* then the number of API calls are very less i.e. *2*.




*Tools / S3 library details :-*
Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
aws-java-sdk-core-1.10.6.jar

Regards,
Anup Tiwari

Re: Too many S3 API calls for simple queries like select and create external table

Posted by Rajesh Balamohan <rb...@apache.org>.
If you are using parquet format, HIVE-25827
<https://issues.apache.org/jira/browse/HIVE-25827>would be causing
additional calls to s3 as the footer is read atleast twice. Add to this
atleast 9+ list_status calls being made for split gen.

~Rajesh.B

On Mon, Feb 21, 2022 at 10:16 AM Sungwoo Park <gl...@gmail.com> wrote:

> My understanding is that additional calls to S3 APi is the price to pay
> for using the Hadoop library which only emulates FileSystem on top of S3.
> S3 is not a distributed file system like HDFS, so some of the API calls
> cannot be optimized in an ideal way.
>
> For (i), a more serious problem is the cost of traversing the entire
> directory, which is totally unnecessary. This was fixed in HIVE-24849:
>
> https://issues.apache.org/jira/browse/HIVE-24849
>
> You can find several JIRAs that try to reduce the overhead of calling S3
> at a higher level, e.g.:
>
> https://issues.apache.org/jira/browse/HIVE-25277
> https://issues.apache.org/jira/browse/HIVE-24546
>
> We can also remove some of S3 calls with a technique described in
> HIVE-24546. However, I think, unless some optimization is implemented at
> the level of Hadoop, the overhead cannot be completely eliminated.
>
> --- Sungwoo
>
> On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <an...@gmail.com>
> wrote:
>
>> Hi Team,
>>
>> We are using Hive heavily for our ETL where our data is stored on S3 and
>> so we have seen a strange behaviour between Hive & S3 interaction in terms
>> of S3 API calls i.e. Actual number of API calls for simple select
>> statements were too much compared to expected so let us know why it is
>> behaving like this because if we execute the same select statement via
>> Athena then the number of API calls are very less.
>>
>>
>>
>> *Background :-*
>> We are incurring some S3 API cost and to understand each API call better,
>> we decided to do simple testing.
>>
>> 1. We have a non partition table containing a lot of objects in parquet
>> format on S3.
>>
>> 2. We copied one parquet file object(data) to a separate S3
>> bucket(target) so now our target bucket contains one parquet file data in
>> following hierarchy on S3 :-
>> s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)
>>
>> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
>> managed by us on EC2 cluster :-
>>
>> (i) Create External table on top of above S3 location :-
>>
>> CREATE EXTERNAL TABLE `anup.Test`(
>>   `id` int,
>>   `cname` varchar(45),
>>   `mef` decimal(10,3),
>>   `mlpr` int,
>>   `qperiod` int,
>>   `validity` int,
>>   `rpmult` decimal(10,3))
>> ROW FORMAT SERDE
>>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
>> STORED AS INPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
>> OUTPUTFORMAT
>>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
>> LOCATION
>>   's3a://<BUCKET>/Test' ;
>>
>> (ii) msck repair table Test(Just to test behaviour) ;
>> (iii) Simple select statement :- select * from anup.Test ;
>>
>>
>>
>>
>> *Results :-*
>> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>>
>>
>> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
>> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
>> create "Test/" object.
>> 2. msck repair : 1-2 API calls ; since we have single object behind table
>> 3. select *  : 1-2 API calls ; since we have single object behind table
>>
>>
>> But *Actual number of Total API calls was 37* and we have fetched this
>> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>>
>> 1. Create External : 9 API calls
>> 2. msck repair : 3 API calls
>> 3. select *  : 25 API calls
>>
>>
>> Attaching actual results of S3 Access Logs for select command along with
>> DEBUG logs of Hive for select statement.
>>
>> Let us know why so many API calls are happening for the Create External /
>> select statement because if we execute the same select statement *via
>> Athena* then the number of API calls are very less i.e. *2*.
>>
>>
>>
>>
>> *Tools / S3 library details :-*
>> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
>> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
>> aws-java-sdk-core-1.10.6.jar
>>
>> Regards,
>> Anup Tiwari
>>
>

Re: Too many S3 API calls for simple queries like select and create external table

Posted by Sungwoo Park <gl...@gmail.com>.
My understanding is that additional calls to S3 APi is the price to pay for
using the Hadoop library which only emulates FileSystem on top of S3. S3 is
not a distributed file system like HDFS, so some of the API calls cannot be
optimized in an ideal way.

For (i), a more serious problem is the cost of traversing the entire
directory, which is totally unnecessary. This was fixed in HIVE-24849:

https://issues.apache.org/jira/browse/HIVE-24849

You can find several JIRAs that try to reduce the overhead of calling S3 at
a higher level, e.g.:

https://issues.apache.org/jira/browse/HIVE-25277
https://issues.apache.org/jira/browse/HIVE-24546

We can also remove some of S3 calls with a technique described in
HIVE-24546. However, I think, unless some optimization is implemented at
the level of Hadoop, the overhead cannot be completely eliminated.

--- Sungwoo

On Fri, Feb 18, 2022 at 7:43 PM Anup Tiwari <an...@gmail.com> wrote:

> Hi Team,
>
> We are using Hive heavily for our ETL where our data is stored on S3 and
> so we have seen a strange behaviour between Hive & S3 interaction in terms
> of S3 API calls i.e. Actual number of API calls for simple select
> statements were too much compared to expected so let us know why it is
> behaving like this because if we execute the same select statement via
> Athena then the number of API calls are very less.
>
>
>
> *Background :-*
> We are incurring some S3 API cost and to understand each API call better,
> we decided to do simple testing.
>
> 1. We have a non partition table containing a lot of objects in parquet
> format on S3.
>
> 2. We copied one parquet file object(data) to a separate S3 bucket(target)
> so now our target bucket contains one parquet file data in following
> hierarchy on S3 :-
> s3://<BUCKET>/Test/000000_0   (Size of object : 1218 Bytes)
>
> 3. After that, we have executed following 3 command in Apache Hive 2.1.1
> managed by us on EC2 cluster :-
>
> (i) Create External table on top of above S3 location :-
>
> CREATE EXTERNAL TABLE `anup.Test`(
>   `id` int,
>   `cname` varchar(45),
>   `mef` decimal(10,3),
>   `mlpr` int,
>   `qperiod` int,
>   `validity` int,
>   `rpmult` decimal(10,3))
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://<BUCKET>/Test' ;
>
> (ii) msck repair table Test(Just to test behaviour) ;
> (iii) Simple select statement :- select * from anup.Test ;
>
>
>
>
> *Results :-*
> Ideally, we were *expecting max 5-10 API calls* with below breakdown
>
>
> 1. Create External : max 2-3 API calls ; which could be GET.BUCKET,
> HEAD.OBJECTS(to check if Test exists or not) and then maybe PUT.OBJECTS to
> create "Test/" object.
> 2. msck repair : 1-2 API calls ; since we have single object behind table
> 3. select *  : 1-2 API calls ; since we have single object behind table
>
>
> But *Actual number of Total API calls was 37* and we have fetched this
> from S3 Access Logs via Athena. Breakdown of these calls are as follows :-
>
> 1. Create External : 9 API calls
> 2. msck repair : 3 API calls
> 3. select *  : 25 API calls
>
>
> Attaching actual results of S3 Access Logs for select command along with
> DEBUG logs of Hive for select statement.
>
> Let us know why so many API calls are happening for the Create External /
> select statement because if we execute the same select statement *via
> Athena* then the number of API calls are very less i.e. *2*.
>
>
>
>
> *Tools / S3 library details :-*
> Apache Hive 2.1.1 / Apache Hadoop 2.8.0 / hadoop-aws-2.8.0.jar /
> aws-java-sdk-s3-1.10.6.jar / aws-java-sdk-kms-1.10.6.jar /
> aws-java-sdk-core-1.10.6.jar
>
> Regards,
> Anup Tiwari
>