You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by "hililiwei (via GitHub)" <gi...@apache.org> on 2023/04/28 10:33:16 UTC

[GitHub] [iceberg] hililiwei opened a new pull request, #7460: Spark 3.4: Time range rewrite data files

hililiwei opened a new pull request, #7460:
URL: https://github.com/apache/iceberg/pull/7460

   ## What is the purpose of the change
   support data files rewriting using a time range.
   
   ```
   actions().rewriteDataFiles(table)
            .startTimestamp(1682677842000)
            .endTimestamp(1682677843000)
            .execute();
   
   actions().rewriteDataFiles(table)
            .startSnapshotId(1)
            .endSnapshotId(2)
            .execute();
   
   
   CALL %s.system.rewrite_data_files(
           table => '%s', 
           options => map('start-timestamp','1682677842000', 'end-timestamp','1682677843000')
   )
   
   CALL %s.system.rewrite_data_files(
           table => '%s', 
           options => map('start-snapshot-id','1', 'end-snapshot-id','2')
   )
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #7460: Spark 3.4: Time range rewrite data files

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #7460:
URL: https://github.com/apache/iceberg/pull/7460#issuecomment-1569080324

   I guess for certain files we only know that it was added before a certain time point if we only have an "existing" record for that file in the metadata


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #7460: Spark 3.4: Time range rewrite data files

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #7460:
URL: https://github.com/apache/iceberg/pull/7460#issuecomment-1536588834

   I feel like snapshot start and end is the wrong way to go on this, instead do we have a way of just specifying timestamp? IE only rewrite files created before timestamp X ? I've been thinking about this as being part of an extension of rewrite datafiles that enables writing predicates on file properties or metadata instead of data properties. 
   
   Not sure if this is possible but I was wondering if we could support something like
   
   rewrite where file.created_at < some timepoint


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] RussellSpitzer commented on pull request #7460: Spark 3.4: Time range rewrite data files

Posted by "RussellSpitzer (via GitHub)" <gi...@apache.org>.
RussellSpitzer commented on PR #7460:
URL: https://github.com/apache/iceberg/pull/7460#issuecomment-1569079521

   Can't we determine when every file was added by looking at what snapshot they were added in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] aokolnychyi commented on pull request #7460: Spark 3.4: Time range rewrite data files

Posted by "aokolnychyi (via GitHub)" <gi...@apache.org>.
aokolnychyi commented on PR #7460:
URL: https://github.com/apache/iceberg/pull/7460#issuecomment-1527987733

   @hililiwei, out of curiosity, you mention it is needed for large tables. In use cases you have, what's the main problem? Is it the time to analyze all of the metadata or is it more about compacting only fresh data?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org


[GitHub] [iceberg] hililiwei commented on pull request #7460: Spark 3.4: Time range rewrite data files

Posted by "hililiwei (via GitHub)" <gi...@apache.org>.
hililiwei commented on PR #7460:
URL: https://github.com/apache/iceberg/pull/7460#issuecomment-1539307538

   > @hililiwei, out of curiosity, you mention it is needed for large tables. In use cases you have, what's the main problem? Is it the time to analyze all of the metadata or is it more about compacting only fresh data?
   
   it is more about compacting only fresh data. We have scheduled jobs that rewrites the newly data.
   
   
   > I feel like snapshot start and end is the wrong way to go on this, instead do we have a way of just specifying timestamp? IE only rewrite files created before timestamp X ? I've been thinking about this as being part of an extension of rewrite datafiles that enables writing predicates on file properties or metadata instead of data properties.
   > 
   > Not sure if this is possible but I was wondering if we could support something like
   > 
   > rewrite where file.created_at < some timepoint
   
   Our current use case also rewrites data files by timestamp. The usage is as follows:
   ```
   CALL %s.system.rewrite_data_files(
           table => '%s', 
           options => map('start-timestamp','1682677842000', 'end-timestamp','1682677843000')
   )
   
   CALL %s.system.rewrite_data_files(
           table => '%s', 
           options => map('end-timestamp','1682677843000')
   )
   ```
   
   > `rewrite where file.created_at < some timepoint`.
   
   This syntax is very intuitive for users, but we don’t seem to keep the creation time of the files in the metadata, using the snapshot time achieve the same effect? 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org