You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by Attila Magyar <am...@hortonworks.com> on 2019/10/31 11:16:11 UTC

Review Request 71707: Performance degradation on single row inserts

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/
-----------------------------------------------------------

Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.


Bugs: HIVE-22411
    https://issues.apache.org/jira/browse/HIVE-22411


Repository: hive-git


Description
-------

Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.

Therefore insertion time goes up linearly.


Diffs
-----

  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java 155ecb18bf5 


Diff: https://reviews.apache.org/r/71707/diff/1/


Testing
-------

measured and plotted insertation time


Thanks,

Attila Magyar


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Attila Magyar <am...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218524
-----------------------------------------------------------




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 331 (original), 324 (patched)
<https://reviews.apache.org/r/71707/#comment306265>

    BlobStorageUtils::isBlobStorageFileSystem() checks if the scheme is either "s3","s3n" or "s3a". But only S3AFileSystem has the optimized listFiles(). In NativeS3FileSystem does not override the tree walking algorithm from the base class.
    
    See: https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3861
    
    and:
    
    https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java


- Attila Magyar


On Nov. 7, 2019, 9:23 a.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2019, 9:23 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
>   common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 09343e56166 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/3/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Attila Magyar <am...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/
-----------------------------------------------------------

(Updated Nov. 7, 2019, 9:23 a.m.)


Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.


Changes
-------

adressing review comments


Bugs: HIVE-22411
    https://issues.apache.org/jira/browse/HIVE-22411


Repository: hive-git


Description
-------

Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.

Therefore insertion time goes up linearly.


Diffs (updated)
-----

  common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
  common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 09343e56166 
  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 


Diff: https://reviews.apache.org/r/71707/diff/3/

Changes: https://reviews.apache.org/r/71707/diff/2-3/


Testing
-------

measured and plotted insertation time


Thanks,

Attila Magyar


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Panos Garefalakis via Review Board <no...@reviews.apache.org>.

> On Nov. 5, 2019, 4:33 p.m., Panos Garefalakis wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Lines 328 (patched)
> > <https://reviews.apache.org/r/71707/diff/2/?file=2171542#file2171542line335>
> >
> >     Hey Attila, the solution looks good however, as other fileSystems might face similar issues in the future using this recursive method (i.e. Azure Blob storage)  wouldn't it make sense to have hdfs a the base case and others separately? and maybe throw a warn message here when the filesystem is not supported?
> 
> Attila Magyar wrote:
>     Hey Panos, I checked the hadoop project and I found only one FS implementation with optimized recursive listFiles(), other implementations use the tree walking impl. from the base class. I think that's the more common case. Do you know where is the source of this Azure Blob storage? Is that one open source at all?

Hey Attila, I was referring to this: https://hadoop.apache.org/docs/current/hadoop-azure/index.html 
but I was also assuming that the recursive method you modified be called for other filesystems as well - if thats not the case then my comment does not apply :)


- Panos


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
-----------------------------------------------------------


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Attila Magyar <am...@hortonworks.com>.

> On Nov. 5, 2019, 4:33 p.m., Panos Garefalakis wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Lines 328 (patched)
> > <https://reviews.apache.org/r/71707/diff/2/?file=2171542#file2171542line335>
> >
> >     Hey Attila, the solution looks good however, as other fileSystems might face similar issues in the future using this recursive method (i.e. Azure Blob storage)  wouldn't it make sense to have hdfs a the base case and others separately? and maybe throw a warn message here when the filesystem is not supported?

Hey Panos, I checked the hadoop project and I found only one FS implementation with optimized recursive listFiles(), other implementations use the tree walking impl. from the base class. I think that's the more common case. Do you know where is the source of this Azure Blob storage? Is that one open source at all?


- Attila


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
-----------------------------------------------------------


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Panos Garefalakis via Review Board <no...@reviews.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218505
-----------------------------------------------------------




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Lines 328 (patched)
<https://reviews.apache.org/r/71707/#comment306253>

    Hey Attila, the solution looks good however, as other fileSystems might face similar issues in the future using this recursive method (i.e. Azure Blob storage)  wouldn't it make sense to have hdfs a the base case and others separately? and maybe throw a warn message here when the filesystem is not supported?


- Panos Garefalakis


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Attila Magyar <am...@hortonworks.com>.

> On Nov. 5, 2019, 11:59 p.m., Ashutosh Chauhan wrote:
> > standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
> > Line 331 (original), 324 (patched)
> > <https://reviews.apache.org/r/71707/diff/2/?file=2171542#file2171542line331>
> >
> >     you may use BlobStorageUtils::isBlobStorageFileSystem() here.

isBlobStorageFileSystem matches to s3,s3a,s3n, but only S3AFileSystem (https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java#L3861) has an optimized listFiles() implementation.

NativeS3FileSystem (https://github.com/apache/hadoop/blob/1d5d7d0989e9ee2f4527dc47ba5c80e1c38f641a/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3native/NativeS3FileSystem.java) uses the same tree travesing algorithm from the base class.


- Attila


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218518
-----------------------------------------------------------


On Nov. 7, 2019, 9:23 a.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 7, 2019, 9:23 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   common/src/java/org/apache/hadoop/hive/common/FileUtils.java 651b842f688 
>   common/src/java/org/apache/hadoop/hive/common/HiveStatsUtils.java 09343e56166 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/3/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Ashutosh Chauhan <ha...@apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218518
-----------------------------------------------------------




standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 323 (original), 321 (patched)
<https://reviews.apache.org/r/71707/#comment306261>

    can you please also make similiar change to common/src/java/org/apache/hadoop/hive/common/FileUtils.java::listStatusRecursively() so that method also benefits from this change.



standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Line 331 (original), 324 (patched)
<https://reviews.apache.org/r/71707/#comment306259>

    you may use BlobStorageUtils::isBlobStorageFileSystem() here.



standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java
Lines 378 (patched)
<https://reviews.apache.org/r/71707/#comment306260>

    BlobStorageUtils::isBlobStorageFileSystem() instead


- Ashutosh Chauhan


On Nov. 5, 2019, 3:32 p.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Nov. 5, 2019, 3:32 p.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/2/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Attila Magyar <am...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/
-----------------------------------------------------------

(Updated Nov. 5, 2019, 3:32 p.m.)


Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.


Changes
-------

Adressing Ashutosh's comments


Bugs: HIVE-22411
    https://issues.apache.org/jira/browse/HIVE-22411


Repository: hive-git


Description
-------

Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.

Therefore insertion time goes up linearly.


Diffs (updated)
-----

  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
  standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java bf206fffc26 


Diff: https://reviews.apache.org/r/71707/diff/2/

Changes: https://reviews.apache.org/r/71707/diff/1-2/


Testing
-------

measured and plotted insertation time


Thanks,

Attila Magyar


Re: Review Request 71707: Performance degradation on single row inserts

Posted by Slim Bouguerra <sb...@hortonworks.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71707/#review218479
-----------------------------------------------------------



looked at the code looks good to me.

- Slim Bouguerra


On Oct. 31, 2019, 11:16 a.m., Attila Magyar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71707/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2019, 11:16 a.m.)
> 
> 
> Review request for hive, Ashutosh Chauhan, Peter Vary, and Slim Bouguerra.
> 
> 
> Bugs: HIVE-22411
>     https://issues.apache.org/jira/browse/HIVE-22411
> 
> 
> Repository: hive-git
> 
> 
> Description
> -------
> 
> Executing single insert statements on a transactional table effects write performance on a s3 file system. Each insert creates a new delta directory. After each insert hive calculates statistics like number of file in the table and total size of the table. In order to calculate these, it traverses the directory recursively. During the recursion for each path a separate listStatus call is executed. In the end the more delta directory you have the more time it takes to calculate the statistics.
> 
> Therefore insertion time goes up linearly.
> 
> 
> Diffs
> -----
> 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/Warehouse.java 38e843aeacf 
>   standalone-metastore/metastore-common/src/main/java/org/apache/hadoop/hive/metastore/utils/FileUtils.java 155ecb18bf5 
> 
> 
> Diff: https://reviews.apache.org/r/71707/diff/1/
> 
> 
> Testing
> -------
> 
> measured and plotted insertation time
> 
> 
> Thanks,
> 
> Attila Magyar
> 
>