You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/15 17:35:40 UTC
[GitHub] [iceberg] cccs-jc commented on issue #1931: expire snapshot does not remove snap-*.avro files

cccs-jc commented on issue #1931:
URL: https://github.com/apache/iceberg/issues/1931#issuecomment-745447983


   
   
   Iceberg 0.9.0
   spark 3.0.0
   
   
   Steps to reproduce
   
   ```
   
   def createDataFrame(startid, now, numRows, numFiles):
   
       df = spark.range(start=startid, end=numRows+startid, numPartitions=numFiles)
   
       df1 = ( df.select(
               df.id,
               (now + (df.id * INCREMENT_PER_FILE)).cast(TimestampType()).alias('loadedby'),
               (now + (df.id * INCREMENT_PER_FILE) - (5 * INCREMENT_PER_FILE) ).cast(TimestampType()).alias('eventtime'),
               F.expr('concat(uuid())').alias('data')
           ))
       return df1
   
   
   now = 1607043600
   FILES_PER_HOUR = 500
   SECONDS_PER_HOUR = 60 * 60
   INCREMENT_PER_FILE = SECONDS_PER_HOUR / FILES_PER_HOUR 
   
   NUM_ROWS = FILES_PER_HOUR
   NUM_FILES = FILES_PER_HOUR
   startid = 0
   df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
   df.createOrReplaceTempView('TMP_TABLE')
   
   # create initial table
   CREATE OR REPLACE TABLE iceberg.test.danglingmetadata
       USING iceberg
       PARTITIONED BY (hours(loadedby), hours(eventtime))
       TBLPROPERTIES (
       'write.metadata.delete-after-commit.enabled'='true',
       'write.metadata.previous-versions-max'='1'
       )
       AS SELECT * FROM TMP_TABLE
   
   
   # replace partitions with insert overwrite
   startid = startid + (NUM_ROWS)
   NUMBER_OF_HOURS = 1
   NUMBER_OF_LOADS = NUMBER_OF_HOURS * LOADS_PER_HOUR
   for i in range(0, NUMBER_OF_LOADS):
       global startid
       print(startid)
       NUM_ROWS = FILES_PER_HOUR / LOADS_PER_HOUR
       NUM_FILES = FILES_PER_HOUR / LOADS_PER_HOUR
       df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
       df.createOrReplaceTempView('TMP_TABLE')
       spark.sql(f'INSERT OVERWRITE iceberg.test.danglingmetadata SELECT * FROM TMP_TABLE') 
       startid = startid + (NUM_ROWS)
   
   ```
   
   ```
   spark.table(f'iceberg.test.danglingmetadata.snapshots').count()
   ```
   21 snapshots
   
   ```
   spark.table(f'iceberg.test.danglingmetadata.manifests').count()
   ```
   2 manifests
   
   Listing metadata directory shows 65 files in total, 21 of which are snap-*.avro files
   
   Running expire snapshot (up to last minute)
   ```
   table.expireSnapshots().expireOlderThan(tsToExpire).commit()
   ```
   
   In my example table the expire Action removed the snapshot and manifest files.
   
   Listing metadata directory now shows 6 files in total, 1 of which are snap-*.avro files.
   
   I'm not sure why the other table we have got into a "bad" state. Maybe there were some failed operations... When the table is in that state the expire snapshot does not remove un-referenced metadata files. This is reasonable however it would be useful to have an Action which specifically removes "dangling" metadata files.
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org