You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2022/05/11 15:55:28 UTC
[GitHub] [iceberg] RussellSpitzer commented on pull request #4674: Spark-3.2: Avoid duplicate computation of ALL_MANIFESTS metadata table for spark actions

RussellSpitzer commented on PR #4674:
URL: https://github.com/apache/iceberg/pull/4674#issuecomment-1123959424

   > @RussellSpitzer, @szehon-ho, @aokolnychyi: PR is ready for review.
   > 
   > a) I tested with local file system and local test case in IDE, without cache (22ms), with cache (18ms) with 1000 manifest list files to read. Not so much difference, probably because it is local file system. As PR avoids reading manifest list files again (verified with logs), in S3 file system probably better difference can be seen with huge number of snapshots.
   > 
   > ```
   > @Test
   >   public void testExpireSnapshotUsingNamedArgs() {
   >     sql("CREATE TABLE %s (id bigint NOT NULL, data string) USING iceberg", tableName);
   > 
   >     for (int i = 0; i < 1000; i++) {
   >       sql("INSERT INTO TABLE %s VALUES (1, 'a')", tableName);
   >     }
   > 
   >     Table table = validationCatalog.loadTable(tableIdent);
   > 
   >     Assert.assertEquals("Should be 1000 snapshots", 1000, Iterables.size(table.snapshots()));
   > 
   >     waitUntilAfter(table.currentSnapshot().timestampMillis());
   > 
   >     Timestamp currentTimestamp = Timestamp.from(Instant.ofEpochMilli(System.currentTimeMillis()));
   > 
   >     long time = System.currentTimeMillis();
   > 
   >     List<Object[]> output = sql(
   >         "CALL %s.system.expire_snapshots(" +
   >             "older_than => TIMESTAMP '%s'," +
   >             "table => '%s'," +
   >             "retain_last => 999, use_caching => false)",
   >         catalogName, currentTimestamp, tableIdent);
   > 
   >     System.out.println("time taken in ms: " + (System.currentTimeMillis() - time));
   >   }
   > 
   >   time taken in ms: 18478  -- with cache
   >   time taken in ms: 22589  -- without cache
   > ```
   > 
   > **b) Also in the base code I found we use caching for `RewriteManifestsProcedure`, so I used the same pattern and syntax (keeping cache as default).** Also found a bug in the base code and raised PR #4722
   > 
   > c) I have some more plans for improving the `expire_snaphots` (different idea than #3457), I will work in the subsequent PR.
   > 
   > Idea is that add a new column "from_snapshot_id" while preparing the actual files, then filter out (NOT IN filter) the expired snapshot ids rows from persisted output (without scanning again) and then same logic of df.except() to find the expired files.
   
   The benchmarks really need to be on datasets that take in the "minutes" to run if we really want to see the difference. The key would be to setup a test with thousands or tens of thousands of manifests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org