You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2020/12/15 17:35:40 UTC
[GitHub] [iceberg] cccs-jc commented on issue #1931: expire snapshot does not remove snap-*.avro files
cccs-jc commented on issue #1931:
URL: https://github.com/apache/iceberg/issues/1931#issuecomment-745447983
Iceberg 0.9.0
spark 3.0.0
Steps to reproduce
```
def createDataFrame(startid, now, numRows, numFiles):
df = spark.range(start=startid, end=numRows+startid, numPartitions=numFiles)
df1 = ( df.select(
df.id,
(now + (df.id * INCREMENT_PER_FILE)).cast(TimestampType()).alias('loadedby'),
(now + (df.id * INCREMENT_PER_FILE) - (5 * INCREMENT_PER_FILE) ).cast(TimestampType()).alias('eventtime'),
F.expr('concat(uuid())').alias('data')
))
return df1
now = 1607043600
FILES_PER_HOUR = 500
SECONDS_PER_HOUR = 60 * 60
INCREMENT_PER_FILE = SECONDS_PER_HOUR / FILES_PER_HOUR
NUM_ROWS = FILES_PER_HOUR
NUM_FILES = FILES_PER_HOUR
startid = 0
df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
df.createOrReplaceTempView('TMP_TABLE')
# create initial table
CREATE OR REPLACE TABLE iceberg.test.danglingmetadata
USING iceberg
PARTITIONED BY (hours(loadedby), hours(eventtime))
TBLPROPERTIES (
'write.metadata.delete-after-commit.enabled'='true',
'write.metadata.previous-versions-max'='1'
)
AS SELECT * FROM TMP_TABLE
# replace partitions with insert overwrite
startid = startid + (NUM_ROWS)
NUMBER_OF_HOURS = 1
NUMBER_OF_LOADS = NUMBER_OF_HOURS * LOADS_PER_HOUR
for i in range(0, NUMBER_OF_LOADS):
global startid
print(startid)
NUM_ROWS = FILES_PER_HOUR / LOADS_PER_HOUR
NUM_FILES = FILES_PER_HOUR / LOADS_PER_HOUR
df = createDataFrame(startid, now, NUM_ROWS, NUM_FILES)
df.createOrReplaceTempView('TMP_TABLE')
spark.sql(f'INSERT OVERWRITE iceberg.test.danglingmetadata SELECT * FROM TMP_TABLE')
startid = startid + (NUM_ROWS)
```
```
spark.table(f'iceberg.test.danglingmetadata.snapshots').count()
```
21 snapshots
```
spark.table(f'iceberg.test.danglingmetadata.manifests').count()
```
2 manifests
Listing metadata directory shows 65 files in total, 21 of which are snap-*.avro files
Running expire snapshot (up to last minute)
```
table.expireSnapshots().expireOlderThan(tsToExpire).commit()
```
In my example table the expire Action removed the snapshot and manifest files.
Listing metadata directory now shows 6 files in total, 1 of which are snap-*.avro files.
I'm not sure why the other table we have got into a "bad" state. Maybe there were some failed operations... When the table is in that state the expire snapshot does not remove un-referenced metadata files. This is reasonable however it would be useful to have an Action which specifically removes "dangling" metadata files.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org