You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by raghavendra186 <ra...@gmail.com> on 2021/08/19 21:24:20 UTC

Iceberg - Remove orphan files issue with glue catalog running on EMR

Hi Guys,

I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i
run removeOrphanFiles either using Actions or SparkActions class and its
functions it works with hadoop catalog when run locally and i face below
exception when run on EMR with glue catalog. Could you please help me with
what I am missing here?

code snippet.

Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();

or

SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();

issue (when run on EMR):

21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
Status:Started, tenant: 1, table:raghu3.cars,
removeOrphanFilesOlderThan: {1629360476572}.

21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
Illegal Arguments in table properties - Can't parse null value from
table properties, tenant: tenantId1, table: raghu3.cars,
removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.

java.lang.IllegalArgumentException: Cannot find the metadata table for
glue_catalog.raghu3.cars of type ALL_MANIFESTS
	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)


Table does exists

[image: image.png]

Did any one face this? What is the fix? Is it a bug or am I missing
something here?

Thanks,
Raghu

Re: Iceberg - Remove orphan files issue with glue catalog running on EMR

Posted by Jack Ye <ye...@gmail.com>.
This looks like the issue around Spark resolving a custom catalog in 0.11
at first glance, based on the code here:
https://github.com/apache/iceberg/blob/29cf712a821aa937e176f2d79a5593c4a1429e7f/spark/src/main/java/org/apache/iceberg/actions/BaseSparkAction.java#L138-L170

Could you provide more details of the stack trace beyond
BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)?

Also that codebase has changed a lot since 0.11, I would recommend you to
try with latest EMR Spark 3.1 version and the newly released Iceberg 0.12.0
to see if the problem persists.

Best,
Jack Ye

On Thu, Aug 19, 2021 at 3:11 PM raghavendra186 <ra...@gmail.com>
wrote:

> Hi Guys,
>
> I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i
> run removeOrphanFiles either using Actions or SparkActions class and its
> functions it works with hadoop catalog when run locally and i face below
> exception when run on EMR with glue catalog. Could you please help me with
> what I am missing here?
>
> code snippet.
>
> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>
> or
>
> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>
> issue (when run on EMR):
>
> 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan: {1629360476572}.
>
> 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: raghu3.cars, removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.
>
> java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.raghu3.cars of type ALL_MANIFESTS
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
> 	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>
>
> Table does exists
>
> [image: image.png]
>
> Did any one face this? What is the fix? Is it a bug or am I missing something here?
>
> Thanks,
> Raghu
>

Re: Iceberg - Remove orphan files issue with glue catalog running on EMR

Posted by Jack Ye <ye...@gmail.com>.
Based on the stacktrace, my thinking is that:
1. for the first 2 use cases, it seems like the dynamic catalog resolution
is not properly done for the Spark environment:
 -
https://github.com/apache/iceberg/blob/80ff749b823098db82d0a8dc48c7e9db5ab3741b/spark/src/main/java/org/apache/iceberg/spark/SparkTableUtil.java#L618-L637
 -
https://github.com/apache/iceberg/blob/80ff749b823098db82d0a8dc48c7e9db5ab3741b/spark3/src/main/java/org/apache/iceberg/spark/Spark3Util.java#L741-L757

Maybe trying to use a fully qualified table name can fix the issue, but
more context is needed for further debugging.

2. For the second use case, I think catalog resolution succeeded, but for
this specific procedure, it has to use Hadoop file system listing to
compare all the files under a directory versus the Iceberg files to
remove orphan files, that's how "orphan" is defined in that context. Your
environment needs to be able to resolve the s3:// scheme in the Hadoop file
system. Currently AWS EMR is the Spark vendor that supports the s3://
scheme by default. This procedure also does not work well right now for use
cases like object storage mode, and is only oriented for Hive
structured tables.

btw, feel free to reach out through Slack if that's faster for debugging,
and we can summarize the results in the dev list.

Best,
Jack Ye



On Tue, Aug 31, 2021 at 8:25 AM raghavendra186 <ra...@gmail.com>
wrote:

> Hi,
>
> RemoveOrphanFiles is working with only hadoop FS/IO and when run from
> local with hadoop catalog. when i try to run it for S3 files using glue
> catalog and from EMR. It throws the below error. i have tried with both
> iceberg 11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also
> tried both the commands from Actions API and also from Spark Actions API.
> the result does not change.
>
> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>
> or
>
> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>
> and the error is
>
> 21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: lakehouse_database.mobiletest1, removeOrphanFilesOlderThan: 1630388136606, Status: Failed, Reason: {}.
> java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS
> 	at org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
> 	at org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136)
> 	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>
> and i tried sql version of remove orphan files too and faced below error
>
> sparkSession.sql("CALL glue_catalog.lakehouse_database.remove_orphan_files(table => 'db.mobiletest1')").show();
>
> and the error is
>
> Exception in thread "main"
> org.apache.iceberg.exceptions.RuntimeIOException:
> org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
> scheme "s3"
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157)
> at
> org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
> at
> com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133)
> Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
> FileSystem for scheme "s3"
> at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
> at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
> at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
> at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
> at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
> at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
> at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
> at
> org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214)
>
> Please help fix this problem for me. Is it something to do with my
> implementation or is it a bug with an iceberg?
>
> Thanks,
> Raghu
>
> On Fri, Aug 20, 2021 at 2:54 AM raghavendra186 <ra...@gmail.com>
> wrote:
>
>> Hi Guys,
>>
>> I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when
>> i run removeOrphanFiles either using Actions or SparkActions class and its
>> functions it works with hadoop catalog when run locally and i face below
>> exception when run on EMR with glue catalog. Could you please help me with
>> what I am missing here?
>>
>> code snippet.
>>
>> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>>
>> or
>>
>> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>>
>> issue (when run on EMR):
>>
>> 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan: {1629360476572}.
>>
>> 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: raghu3.cars, removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.
>>
>> java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.raghu3.cars of type ALL_MANIFESTS
>> 	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
>> 	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
>> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
>> 	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
>> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
>> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
>> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
>> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
>> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
>> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
>> 	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
>> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
>> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>>
>>
>> Table does exists
>>
>> [image: image.png]
>>
>> Did any one face this? What is the fix? Is it a bug or am I missing something here?
>>
>> Thanks,
>> Raghu
>>
>

Re: Iceberg - Remove orphan files issue with glue catalog running on EMR

Posted by raghavendra186 <ra...@gmail.com>.
Hi,

RemoveOrphanFiles is working with only hadoop FS/IO and when run from local
with hadoop catalog. when i try to run it for S3 files using glue catalog
and from EMR. It throws the below error. i have tried with both iceberg
11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also tried
both the commands from Actions API and also from Spark Actions API. the
result does not change.

Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();

or

SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();

and the error is

21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in
RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp,
Illegal Arguments in table properties - Can't parse null value from
table properties, tenant: tenantId1, table:
lakehouse_database.mobiletest1, removeOrphanFilesOlderThan:
1630388136606, Status: Failed, Reason: {}.
java.lang.IllegalArgumentException: Cannot find the metadata table for
glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS
	at org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634)
	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153)
	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
	at org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136)
	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)

and i tried sql version of remove orphan files too and faced below error

sparkSession.sql("CALL
glue_catalog.lakehouse_database.remove_orphan_files(table =>
'db.mobiletest1')").show();

and the error is

Exception in thread "main"
org.apache.iceberg.exceptions.RuntimeIOException:
org.apache.hadoop.fs.UnsupportedFileSystemException: No FileSystem for
scheme "s3"
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:236)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.buildActualFileDF(BaseDeleteOrphanFilesSparkAction.java:184)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:157)
at
org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
at
com.salesforce.cdp.lakehouse.spark.tablemaintenance.TestWriter.main(TestWriter.java:133)
Caused by: org.apache.hadoop.fs.UnsupportedFileSystemException: No
FileSystem for scheme "s3"
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:3281)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3301)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:124)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3352)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3320)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:479)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at
org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.listDirRecursively(BaseDeleteOrphanFilesSparkAction.java:214)

Please help fix this problem for me. Is it something to do with my
implementation or is it a bug with an iceberg?

Thanks,
Raghu

On Fri, Aug 20, 2021 at 2:54 AM raghavendra186 <ra...@gmail.com>
wrote:

> Hi Guys,
>
> I am working with iceberg 11.1 version iceberg with spark 3.0.1 and when i
> run removeOrphanFiles either using Actions or SparkActions class and its
> functions it works with hadoop catalog when run locally and i face below
> exception when run on EMR with glue catalog. Could you please help me with
> what I am missing here?
>
> code snippet.
>
> Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute();
>
> or
>
> SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();
>
> issue (when run on EMR):
>
> 21/08/19 08:12:56 INFO RemoveOrphanFilesMaintenanceJob: Running RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Status:Started, tenant: 1, table:raghu3.cars, removeOrphanFilesOlderThan: {1629360476572}.
>
> 21/08/19 08:12:56 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: raghu3.cars, removeOrphanFilesOlderThan: 1629360476572, Status: Failed, Reason: {}.
>
> java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.raghu3.cars of type ALL_MANIFESTS
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:191)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:121)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154)
> 	at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:101)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141)
> 	at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:274)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:117)
> 	at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76)
> 	at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:247)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> 	at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)
>
>
> Table does exists
>
> [image: image.png]
>
> Did any one face this? What is the fix? Is it a bug or am I missing something here?
>
> Thanks,
> Raghu
>