You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/01/30 05:44:16 UTC

[GitHub] [iceberg] kbendick opened a new pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

kbendick opened a new pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008


   This closes https://github.com/apache/iceberg/issues/4007
   
   As of Iceberg 0.13.0, the Spark stored procedures `expire_snapshots` and `remove_orphan_files` have an added parameter, `max_concurrent_deletes`, which indicates the size of the thread pool / executor service that should be instantiated to remove the relevant files in parallel.
   
   Without this parameter, no separate thread pool is instantiated and the files are deleted sequentially in the current thread which can be slow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795151795



##########
File path: site/docs/spark-procedures.md
##########
@@ -221,12 +222,13 @@ Used to remove files which are not referenced in any metadata files of an Iceber
 
 #### Usage
 
-| Argument Name | Required? | Type | Description |
-|---------------|-----------|------|-------------|
-| `table`       | ✔️  | string | Name of the table to clean |
-| `older_than`  | ️   | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
-| `location`    |    | string    | Directory to look for files in (defaults to the table's location) |
-| `dry_run`     |    | boolean   | When true, don't actually remove files (defaults to false) |
+| Argument Name | Required? | Type      | Description                                                                                                                                                |
+|--------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

Review comment:
       Oh I can add it back in though. My bad!
   
   EDIT - I think it actually got extended, due to the larger length of the last column. Let me take a look.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r796050552



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       Agreed it's a bit long. Will update.
   
   If we wanted to expand on the details, we could add an additional sentence below like `older_than` and `retain_last` are. But I think the shorter statement is sufficient.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#issuecomment-1026329359


   Thanks, @kbendick!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795443467



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       Might be not accurate to say no thread pool at all is used, wonder if we still need to specify 'by default, no dedicated thread pool is used'.  (Not sure if that's too academic).




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795151795



##########
File path: site/docs/spark-procedures.md
##########
@@ -221,12 +222,13 @@ Used to remove files which are not referenced in any metadata files of an Iceber
 
 #### Usage
 
-| Argument Name | Required? | Type | Description |
-|---------------|-----------|------|-------------|
-| `table`       | ✔️  | string | Name of the table to clean |
-| `older_than`  | ️   | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
-| `location`    |    | string    | Directory to look for files in (defaults to the table's location) |
-| `dry_run`     |    | boolean   | When true, don't actually remove files (defaults to false) |
+| Argument Name | Required? | Type      | Description                                                                                                                                                |
+|--------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

Review comment:
       Oh I can add it back in though. My bad!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r796163009



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       This is updated.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795155222



##########
File path: site/docs/spark-procedures.md
##########
@@ -221,12 +222,13 @@ Used to remove files which are not referenced in any metadata files of an Iceber
 
 #### Usage
 
-| Argument Name | Required? | Type | Description |
-|---------------|-----------|------|-------------|
-| `table`       | ✔️  | string | Name of the table to clean |
-| `older_than`  | ️   | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
-| `location`    |    | string    | Directory to look for files in (defaults to the table's location) |
-| `dry_run`     |    | boolean   | When true, don't actually remove files (defaults to false) |
+| Argument Name | Required? | Type      | Description                                                                                                                                                |
+|--------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

Review comment:
       Hmmm..... looking at it closer though, it does kind of appear like maybe it was trying to be lined up at one point? I don't know what the standards are here, but probably it was just another IntelliJ user like me that had it formatted the way it was.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r796050552



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       Agreed it's a bit long. Will update.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
rdblue commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795219794



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       The description is a bit long. How about "by default, no threadpool is used"




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] rdblue merged pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
rdblue merged pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] dramaticlly commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
dramaticlly commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795246026



##########
File path: site/docs/spark-procedures.md
##########
@@ -190,6 +190,7 @@ the `expire_snapshots` procedure will never remove files which are still require
 | `table`       | ✔️  | string | Name of the table to update |
 | `older_than`  | ️   | timestamp | Timestamp before which snapshots will be removed (Default: 5 days ago) |
 | `retain_last` |    | int       | Number of ancestor snapshots to preserve regardless of `older_than` (defaults to 1) |
+| `max_concurrent_deletes` |    | int       | Size of the thread pool used for delete file actions (defaults to null, which deletes files serially in the current thread without instantiating a dedicated thread pool) |

Review comment:
       seconded, reads better to me personally
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#issuecomment-1025075845


   cc @RussellSpitzer @szehon-ho @dramaticlly @samredai  @jackye1995 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] szehon-ho commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
szehon-ho commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795149569



##########
File path: site/docs/spark-procedures.md
##########
@@ -221,12 +222,13 @@ Used to remove files which are not referenced in any metadata files of an Iceber
 
 #### Usage
 
-| Argument Name | Required? | Type | Description |
-|---------------|-----------|------|-------------|
-| `table`       | ✔️  | string | Name of the table to clean |
-| `older_than`  | ️   | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
-| `location`    |    | string    | Directory to look for files in (defaults to the table's location) |
-| `dry_run`     |    | boolean   | When true, don't actually remove files (defaults to false) |
+| Argument Name | Required? | Type      | Description                                                                                                                                                |
+|--------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

Review comment:
       Seems some unnecessary ```---``` got removed, though I guess as it's MD table it's not a big deal




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] kbendick commented on a change in pull request #4008: [Site] - Document max_concurrent_deletes parameter in spark stored procedures.

Posted by GitBox <gi...@apache.org>.
kbendick commented on a change in pull request #4008:
URL: https://github.com/apache/iceberg/pull/4008#discussion_r795155222



##########
File path: site/docs/spark-procedures.md
##########
@@ -221,12 +222,13 @@ Used to remove files which are not referenced in any metadata files of an Iceber
 
 #### Usage
 
-| Argument Name | Required? | Type | Description |
-|---------------|-----------|------|-------------|
-| `table`       | ✔️  | string | Name of the table to clean |
-| `older_than`  | ️   | timestamp | Remove orphan files created before this timestamp (Defaults to 3 days ago) |
-| `location`    |    | string    | Directory to look for files in (defaults to the table's location) |
-| `dry_run`     |    | boolean   | When true, don't actually remove files (defaults to false) |
+| Argument Name | Required? | Type      | Description                                                                                                                                                |
+|--------|-----------|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------|

Review comment:
       Hmmm..... looking at it closer though, it does kind of appear like maybe it was trying to be lined up at one point? I don't know what the standards are here, but probably it was just another IntelliJ user like me that had it partially formatted as a pretty printed table. I've reverted that.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org