You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/11/11 00:11:56 UTC

[GitHub] [iceberg] dmgcodevil opened a new pull request, #6174: iss5675: limit total size of data files for compaction

dmgcodevil opened a new pull request, #6174:
URL: https://github.com/apache/iceberg/pull/6174

   Sometimes it's not possible to use a filter to limit the size of files for compaction (generic data pipelines) or the total size per partition exceeds JVM heap. Using `totalSize` a user can limit the total size of input data files.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dmgcodevil commented on pull request #6174: iss5675: limit total size of data files for compaction

Posted by "dmgcodevil (via GitHub)" <gi...@apache.org>.

dmgcodevil commented on PR #6174:
URL: https://github.com/apache/iceberg/pull/6174#issuecomment-1532123723

   @rdblue does this feature make sense to you?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dmgcodevil commented on pull request #6174: iss5675: limit total size of data files for compaction

Posted by GitBox <gi...@apache.org>.

dmgcodevil commented on PR #6174:
URL: https://github.com/apache/iceberg/pull/6174#issuecomment-1312545162

   I've noticed one thing: `isPartialFileScan(task)` check is redundant b/c `task.files().size() > 1` is always true. see `filteredGroupedTasks`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] dmgcodevil commented on pull request #6174: iss5675: limit total size of data files for compaction

Posted by "dmgcodevil (via GitHub)" <gi...@apache.org>.

dmgcodevil commented on PR #6174:
URL: https://github.com/apache/iceberg/pull/6174#issuecomment-1532380590

I've found the following option:

```
/**
* The entire rewrite operation is broken down into pieces based on partitioning and within partitions based
* on size into groups. These sub-units of the rewrite are referred to as file groups. The largest amount of data that
* should be compacted in a single group is controlled by {@link #MAX_FILE_GROUP_SIZE_BYTES}. This helps with
* breaking down the rewriting of very large partitions which may not be rewritable otherwise due to the resource
* constraints of the cluster. For example a sort based rewrite may not scale to terabyte sized partitions, those
* partitions need to be worked on in small subsections to avoid exhaustion of resources.
* <p>
* When grouping files, the underlying rewrite strategy will use this value as to limit the files which
* will be included in a single file group. A group will be processed by a single framework "action". For example,
* in Spark this means that each group would be rewritten in its own Spark action. A group will never contain files
* for multiple output partitions.
*/
String MAX_FILE_GROUP_SIZE_BYTES = "max-file-group-size-bytes";
```

However, would it make sense to limit the number of groups for compaction?

cc/ @rdblue

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] krvikash commented on pull request #6174: iss5675: limit total size of data files for compaction

Posted by GitBox <gi...@apache.org>.

krvikash commented on PR #6174:
URL: https://github.com/apache/iceberg/pull/6174#issuecomment-1311835852

   nit: I think all commits can be squashed into one single commit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org