You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/17 09:35:46 UTC

[GitHub] [iceberg] fengsen-neu opened a new issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

fengsen-neu opened a new issue #3909:
URL: https://github.com/apache/iceberg/issues/3909


   When we use spark action rewriteDataFiles on v2 table, we will also merge equality_delete file with data file.  The equality_delete file will merge with all data file that sequence num less than it, this will cause spark executor memory GC frequently, and cause 'connection reset by peer' , ' hearbeat timeout' error during rewriteDataFiles. 
   How could we limit the jvm using memory when rewriteDataFiles with equality_delete file compations?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] fengsen-neu commented on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

fengsen-neu commented on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1018232270


   > A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.
   
   ![image](https://user-images.githubusercontent.com/58256617/150480410-dffc1c5f-66db-41ab-ac7f-199e8f65542b.png)
   ![image](https://user-images.githubusercontent.com/58256617/150480438-bb672d08-94de-4f1e-86f0-1ee203dc9306.png)
   The DeleteFilter object that include deletefile is using much more memory when compaction. As you said 'each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering', so I think we should control the DeleteFilter size to limit memory use. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] fengsen-neu commented on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

fengsen-neu commented on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1020782807


   > A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.
   
   It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need. --- Could you please provide the PR for my reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] moon-fall edited a comment on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

moon-fall edited a comment on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1017062429


   A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a Datafile bloom filter to filter out unnecessary eq-deletefile keys (datafile hashset also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as  using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile.
    maybe I can pull a request if need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] coolderli commented on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

coolderli commented on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1021785368


   > 
   
   This is also a headache for us. We use RocksDBSet (reference #2680) but it is hard to tune the rocksdb, @moon-fall Could you please provide your PR? It matters to us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] fengsen-neu commented on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

fengsen-neu commented on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1020782807


   > A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys to filter out unnecessary eq-deletefile keys (hashset of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need.
   
   It's easier to read bloom filters directly from a datafile. maybe I can pull a request if need. --- Could you please provide the PR for my reference?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] moon-fall edited a comment on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

moon-fall edited a comment on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1017062429


   A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, But only those keys that are also in the datafile are necessary to read , In the optimized version of my company , I use a bloom filter of datafile's keys  to filter out unnecessary eq-deletefile keys (hashset  of datafile's keys also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as  using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile.
    maybe I can pull a request if need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] moon-fall commented on issue #3909: When we use spark action rewriteDataFiles, how to limit equality_delete file compations memory.

Posted by GitBox <gi...@apache.org>.

moon-fall commented on issue #3909:
URL: https://github.com/apache/iceberg/issues/3909#issuecomment-1017062429


   A large amount of memory is used because each datafile will read all the data of the deletefile which seqNum is bigger than datafile in a hashSet for filtering, In the optimized version of my company , I use a Datafile bloom filter to filter out unnecessary eq-deletefile keys (datafile hashset also works, but using hashset usually consumes more memory) , if the datafile support storage Bloom filter Such as  using parquet format based on #2642 ，It's easier to read bloom filters directly from a datafile.
    maybe I can pull a request if need.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org