You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@iceberg.apache.org by Greg Hill <gn...@paypal.com.INVALID> on 2021/07/07 20:27:03 UTC

GlueCatalog example?

Is there a Java example for the proper way to get the GlueCatalog object? We are trying to convert from HadoopTables and need access to the lower-level APIs to create and update tables with partitions.

I’m looking for something similar to these examples for HadoopTables and HiveCatalog: https://iceberg.apache.org/java-api-quickstart/

From what I can gather looking at the code, this is what I came up with (our catalog name is `iceberg`), but it feels like there’s probably a better way that I’m not seeing:

      this.icebergCatalog = new GlueCatalog();
      Configuration conf = spark.sparkContext().hadoopConfiguration();
      Map<String, String> props = ImmutableMap.of(
        "type", conf.get("spark.sql.catalog.iceberg.type"),
        "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
        "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
        "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
        "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
      );
      this.icebergCatalog.initialize("iceberg", props);

Sorry for the potentially n00b question, but I’m a n00b 😃

Greg

Re: GlueCatalog example?

Posted by Greg Hill <gn...@paypal.com.INVALID>.
We can’t just turn off consistent view because we have to coordinate that change across a bunch of teams that run clusters on our platform. Since it’s only the iceberg metadata files that become inconsistent, we’re trying to move the metadata out of s3 by using the glue catalog but the data files still have to be on EMRFS for the time being.

I’ll check out the code you sent, thanks!


From: Jack Ye <ye...@gmail.com>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Date: Thursday, July 8, 2021 at 12:06 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
I think you need to first call setConf and then initialize, mimicking the logic in https://github.com/apache/iceberg/blob/6bcca16c48cd92dc98640130a28f73431e99e336/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L189-L191which<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F6bcca16c48cd92dc98640130a28f73431e99e336%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FCatalogUtil.java%23L189-L191which&data=04%7C01%7Cgnhill%40paypal.com%7Cecf5a300b87540d1bc8f08d94232b77e%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637613607864000826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bh5giMrRgrY%2F%2FfSt3wI18rhAaGWRsfKsznaUELMPveM%3D&reserved=0> is used by all engines to initialize catalogs. You might be able to directly leverage the CatalogUtil.buildIcebergCatalog instead of writing your customized logic.

With that being said, I remember we had this conversation in another thread and did not continue with it, EMRFS consistent view is now unnecessary as S3 is now strongly consistent. I am not sure if there is any additional benefit you would like to gain by continuing to use EMRFS.

-Jack Ye

On Thu, Jul 8, 2021 at 8:11 AM Greg Hill <gn...@paypal.com.invalid> wrote:
Thanks! Seems I wasn’t too far off then. It’s my understanding that because we’re using EMRFS consistent view, we should not use S3FileIO or the emrfs metadata will get out of sync, but it doesn’t seem like this catalog works with HadoopFileIO so far in my basic testing. I get a NullPointerException because the Hadoop configuration isn’t passed along at some point.

I noticed that I needed to call `setConf()` to get the Hadoop configs into the catalog object.

      Map<String, String> props = ImmutableMap.of(
        "type", "iceberg",
        "warehouse", config.getOutputDir(),
        "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
        "lock.table", config.getDynamoIcebergLocksTable(),
        "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
      );
      this.icebergCatalog.initialize("iceberg", props);

      this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());

Then when I call createTable later:

java.lang.NullPointerException
                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
                at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
                at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
                at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
                at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
                at org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
                at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
                at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
                at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
                at org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
                at org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)

The NPE is because `conf` is null in that method, but I verified that icebergCatalog.hadoopConf is the expected object.

Should it be expected that the GlueCatalog can be used with HadoopFileIO or is it only compatible with S3FileIO?

Greg


From: Jack Ye <ye...@gmail.com>>
Reply-To: "dev@iceberg.apache.org<ma...@iceberg.apache.org>" <de...@iceberg.apache.org>>
Date: Wednesday, July 7, 2021 at 4:16 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
Yeah this is actually a good point, the documentation is mostly around loading the catalog to different SQL engines and lacks Java API examples. The integration tests are good places to see Java examples: https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>

-Jack Ye

On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid> wrote:
Is there a Java example for the proper way to get the GlueCatalog object? We are trying to convert from HadoopTables and need access to the lower-level APIs to create and update tables with partitions.

I’m looking for something similar to these examples for HadoopTables and HiveCatalog: https://iceberg.apache.org/java-api-quickstart/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>

From what I can gather looking at the code, this is what I came up with (our catalog name is `iceberg`), but it feels like there’s probably a better way that I’m not seeing:

      this.icebergCatalog = new GlueCatalog();
      Configuration conf = spark.sparkContext().hadoopConfiguration();
      Map<String, String> props = ImmutableMap.of(
        "type", conf.get("spark.sql.catalog.iceberg.type"),
        "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
        "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
        "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
        "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
      );
      this.icebergCatalog.initialize("iceberg", props);

Sorry for the potentially n00b question, but I’m a n00b 😃

Greg

Re: GlueCatalog example?

Posted by Jack Ye <ye...@gmail.com>.
Yes this is by design. In Iceberg, all the table metadata information is
governed in the file system, so you always have your table metadata JSON
file that is the root of your current table version. Catalog only serves
the purpose of storing the pointer to this root file, and guarantee
atomicity when there is a new table version and the root location needs to
be updated. There are organizations that use a customized implementation of
TableMetadata to store the table metadata information not in a file but in
some storage layer like a database, but it is still not the job of the
catalog to store this information. And I would not really recommend you to
use Glue as storage of this metadata because Glue is a read-optimized
storage for only light-weight table metadata, but Iceberg's table metadata
grows linearly with the number of snapshots.

-Jack Ye

On Fri, Jul 9, 2021 at 8:39 AM Greg Hill <gn...@paypal.com.invalid> wrote:

> Ok, setting the hadoopConf before initialize seems to have gotten past
> that hump. I’m debugging an issue with my tests where it appears that I’m
> not cleaning up glue in between test runs but I am cleaning up the
> filesystem so the metadata files no longer exist and it fails. But why is
> it using metadata files at all with the Glue catalog? The
> “metadata_location” on the Glue table paramaters is set to a file path. Is
> that how this is supposed to work? Does it just remove the need for the
> version-hint.text file but not the metadata.json files?
>
>
>
> Greg
>
>
>
>
>
> *From: *Jack Ye <ye...@gmail.com>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Date: *Thursday, July 8, 2021 at 12:06 PM
> *To: *Iceberg Dev List <de...@iceberg.apache.org>
> *Subject: *Re: GlueCatalog example?
>
>
>
> This message was identified as a phishing scam.
>
> I think you need to first call setConf and then initialize, mimicking the
> logic in
> https://github.com/apache/iceberg/blob/6bcca16c48cd92dc98640130a28f73431e99e336/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L189-L191which
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F6bcca16c48cd92dc98640130a28f73431e99e336%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FCatalogUtil.java%23L189-L191which&data=04%7C01%7Cgnhill%40paypal.com%7Cecf5a300b87540d1bc8f08d94232b77e%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637613607864000826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bh5giMrRgrY%2F%2FfSt3wI18rhAaGWRsfKsznaUELMPveM%3D&reserved=0>
> is used by all engines to initialize catalogs. You might be able to
> directly leverage the CatalogUtil.buildIcebergCatalog instead of writing
> your customized logic.
>
>
>
> With that being said, I remember we had this conversation in another
> thread and did not continue with it, EMRFS consistent view is now
> unnecessary as S3 is now strongly consistent. I am not sure if there is any
> additional benefit you would like to gain by continuing to use EMRFS.
>
>
>
> -Jack Ye
>
>
>
> On Thu, Jul 8, 2021 at 8:11 AM Greg Hill <gn...@paypal.com.invalid>
> wrote:
>
> Thanks! Seems I wasn’t too far off then. It’s my understanding that
> because we’re using EMRFS consistent view, we should not use S3FileIO or
> the emrfs metadata will get out of sync, but it doesn’t seem like this
> catalog works with HadoopFileIO so far in my basic testing. I get a
> NullPointerException because the Hadoop configuration isn’t passed along at
> some point.
>
>
>
> I noticed that I needed to call `setConf()` to get the Hadoop configs into
> the catalog object.
>
>
>
>       Map<String, String> props = ImmutableMap.of(
>
>         "type", "iceberg",
>
>         "warehouse", config.getOutputDir(),
>
>         "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
>
>         "lock.table", config.getDynamoIcebergLocksTable(),
>
>         "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
>
>       );
>
>       this.icebergCatalog.initialize("iceberg", props);
>
>
>
>
> this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());
>
>
>
> Then when I call createTable later:
>
>
>
> java.lang.NullPointerException
>
>                 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
>
>                 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>
>                 at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
>
>                 at
> org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
>
>                 at
> org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
>
>                 at
> org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
>
>                 at
> org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
>
>                 at
> org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
>
>                 at
> org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
>
>                 at
> org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
>
>                 at
> org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)
>
>
>
> The NPE is because `conf` is null in that method, but I verified that
> icebergCatalog.hadoopConf is the expected object.
>
>
>
> Should it be expected that the GlueCatalog can be used with HadoopFileIO
> or is it only compatible with S3FileIO?
>
>
>
> Greg
>
>
>
>
>
> *From: *Jack Ye <ye...@gmail.com>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Date: *Wednesday, July 7, 2021 at 4:16 PM
> *To: *Iceberg Dev List <de...@iceberg.apache.org>
> *Subject: *Re: GlueCatalog example?
>
>
>
> This message was identified as a phishing scam.
>
> Yeah this is actually a good point, the documentation is mostly around
> loading the catalog to different SQL engines and lacks Java API examples.
> The integration tests are good places to see Java examples:
> https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>
>
>
>
> -Jack Ye
>
>
>
> On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid>
> wrote:
>
> Is there a Java example for the proper way to get the GlueCatalog object?
> We are trying to convert from HadoopTables and need access to the
> lower-level APIs to create and update tables with partitions.
>
>
>
> I’m looking for something similar to these examples for HadoopTables and
> HiveCatalog: https://iceberg.apache.org/java-api-quickstart/
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>
>
>
>
> From what I can gather looking at the code, this is what I came up with
> (our catalog name is `iceberg`), but it feels like there’s probably a
> better way that I’m not seeing:
>
>
>
>       this.icebergCatalog = new GlueCatalog();
>
>       Configuration conf = spark.sparkContext().hadoopConfiguration();
>
>       Map<String, String> props = ImmutableMap.of(
>
>         "type", conf.get("spark.sql.catalog.iceberg.type"),
>
>         "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
>
>         "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
>
>         "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
>
>         "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
>
>       );
>
>       this.icebergCatalog.initialize("iceberg", props);
>
>
>
> Sorry for the potentially n00b question, but I’m a n00b 😃
>
>
>
> Greg
>
>

Re: GlueCatalog example?

Posted by Greg Hill <gn...@paypal.com.INVALID>.
Ok, setting the hadoopConf before initialize seems to have gotten past that hump. I’m debugging an issue with my tests where it appears that I’m not cleaning up glue in between test runs but I am cleaning up the filesystem so the metadata files no longer exist and it fails. But why is it using metadata files at all with the Glue catalog? The “metadata_location” on the Glue table paramaters is set to a file path. Is that how this is supposed to work? Does it just remove the need for the version-hint.text file but not the metadata.json files?

Greg


From: Jack Ye <ye...@gmail.com>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Date: Thursday, July 8, 2021 at 12:06 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
I think you need to first call setConf and then initialize, mimicking the logic in https://github.com/apache/iceberg/blob/6bcca16c48cd92dc98640130a28f73431e99e336/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L189-L191which<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F6bcca16c48cd92dc98640130a28f73431e99e336%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FCatalogUtil.java%23L189-L191which&data=04%7C01%7Cgnhill%40paypal.com%7Cecf5a300b87540d1bc8f08d94232b77e%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637613607864000826%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=bh5giMrRgrY%2F%2FfSt3wI18rhAaGWRsfKsznaUELMPveM%3D&reserved=0> is used by all engines to initialize catalogs. You might be able to directly leverage the CatalogUtil.buildIcebergCatalog instead of writing your customized logic.

With that being said, I remember we had this conversation in another thread and did not continue with it, EMRFS consistent view is now unnecessary as S3 is now strongly consistent. I am not sure if there is any additional benefit you would like to gain by continuing to use EMRFS.

-Jack Ye

On Thu, Jul 8, 2021 at 8:11 AM Greg Hill <gn...@paypal.com.invalid> wrote:
Thanks! Seems I wasn’t too far off then. It’s my understanding that because we’re using EMRFS consistent view, we should not use S3FileIO or the emrfs metadata will get out of sync, but it doesn’t seem like this catalog works with HadoopFileIO so far in my basic testing. I get a NullPointerException because the Hadoop configuration isn’t passed along at some point.

I noticed that I needed to call `setConf()` to get the Hadoop configs into the catalog object.

      Map<String, String> props = ImmutableMap.of(
        "type", "iceberg",
        "warehouse", config.getOutputDir(),
        "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
        "lock.table", config.getDynamoIcebergLocksTable(),
        "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
      );
      this.icebergCatalog.initialize("iceberg", props);

      this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());

Then when I call createTable later:

java.lang.NullPointerException
                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
                at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
                at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
                at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
                at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
                at org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
                at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
                at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
                at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
                at org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
                at org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)

The NPE is because `conf` is null in that method, but I verified that icebergCatalog.hadoopConf is the expected object.

Should it be expected that the GlueCatalog can be used with HadoopFileIO or is it only compatible with S3FileIO?

Greg


From: Jack Ye <ye...@gmail.com>>
Reply-To: "dev@iceberg.apache.org<ma...@iceberg.apache.org>" <de...@iceberg.apache.org>>
Date: Wednesday, July 7, 2021 at 4:16 PM
To: Iceberg Dev List <de...@iceberg.apache.org>>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
Yeah this is actually a good point, the documentation is mostly around loading the catalog to different SQL engines and lacks Java API examples. The integration tests are good places to see Java examples: https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>

-Jack Ye

On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid> wrote:
Is there a Java example for the proper way to get the GlueCatalog object? We are trying to convert from HadoopTables and need access to the lower-level APIs to create and update tables with partitions.

I’m looking for something similar to these examples for HadoopTables and HiveCatalog: https://iceberg.apache.org/java-api-quickstart/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>

From what I can gather looking at the code, this is what I came up with (our catalog name is `iceberg`), but it feels like there’s probably a better way that I’m not seeing:

      this.icebergCatalog = new GlueCatalog();
      Configuration conf = spark.sparkContext().hadoopConfiguration();
      Map<String, String> props = ImmutableMap.of(
        "type", conf.get("spark.sql.catalog.iceberg.type"),
        "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
        "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
        "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
        "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
      );
      this.icebergCatalog.initialize("iceberg", props);

Sorry for the potentially n00b question, but I’m a n00b 😃

Greg

Re: GlueCatalog example?

Posted by Jack Ye <ye...@gmail.com>.
I think you need to first call setConf and then initialize, mimicking the
logic in
https://github.com/apache/iceberg/blob/6bcca16c48cd92dc98640130a28f73431e99e336/core/src/main/java/org/apache/iceberg/CatalogUtil.java#L189-L191which
is used by all engines to initialize catalogs. You might be able to
directly leverage the CatalogUtil.buildIcebergCatalog instead of writing
your customized logic.

With that being said, I remember we had this conversation in another thread
and did not continue with it, EMRFS consistent view is now unnecessary as
S3 is now strongly consistent. I am not sure if there is any additional
benefit you would like to gain by continuing to use EMRFS.

-Jack Ye

On Thu, Jul 8, 2021 at 8:11 AM Greg Hill <gn...@paypal.com.invalid> wrote:

> Thanks! Seems I wasn’t too far off then. It’s my understanding that
> because we’re using EMRFS consistent view, we should not use S3FileIO or
> the emrfs metadata will get out of sync, but it doesn’t seem like this
> catalog works with HadoopFileIO so far in my basic testing. I get a
> NullPointerException because the Hadoop configuration isn’t passed along at
> some point.
>
>
>
> I noticed that I needed to call `setConf()` to get the Hadoop configs into
> the catalog object.
>
>
>
>       Map<String, String> props = ImmutableMap.of(
>
>         "type", "iceberg",
>
>         "warehouse", config.getOutputDir(),
>
>         "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
>
>         "lock.table", config.getDynamoIcebergLocksTable(),
>
>         "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
>
>       );
>
>       this.icebergCatalog.initialize("iceberg", props);
>
>
>
>
> this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());
>
>
>
> Then when I call createTable later:
>
>
>
> java.lang.NullPointerException
>
>                 at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
>
>                 at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
>
>                 at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
>
>                 at
> org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
>
>                 at
> org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
>
>                 at
> org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
>
>                 at
> org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
>
>                 at
> org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
>
>                 at
> org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
>
>                 at
> org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
>
>                 at
> org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)
>
>
>
> The NPE is because `conf` is null in that method, but I verified that
> icebergCatalog.hadoopConf is the expected object.
>
>
>
> Should it be expected that the GlueCatalog can be used with HadoopFileIO
> or is it only compatible with S3FileIO?
>
>
>
> Greg
>
>
>
>
>
> *From: *Jack Ye <ye...@gmail.com>
> *Reply-To: *"dev@iceberg.apache.org" <de...@iceberg.apache.org>
> *Date: *Wednesday, July 7, 2021 at 4:16 PM
> *To: *Iceberg Dev List <de...@iceberg.apache.org>
> *Subject: *Re: GlueCatalog example?
>
>
>
> This message was identified as a phishing scam.
>
> Yeah this is actually a good point, the documentation is mostly around
> loading the catalog to different SQL engines and lacks Java API examples.
> The integration tests are good places to see Java examples:
> https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>
>
>
>
> -Jack Ye
>
>
>
> On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid>
> wrote:
>
> Is there a Java example for the proper way to get the GlueCatalog object?
> We are trying to convert from HadoopTables and need access to the
> lower-level APIs to create and update tables with partitions.
>
>
>
> I’m looking for something similar to these examples for HadoopTables and
> HiveCatalog: https://iceberg.apache.org/java-api-quickstart/
> <https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>
>
>
>
> From what I can gather looking at the code, this is what I came up with
> (our catalog name is `iceberg`), but it feels like there’s probably a
> better way that I’m not seeing:
>
>
>
>       this.icebergCatalog = new GlueCatalog();
>
>       Configuration conf = spark.sparkContext().hadoopConfiguration();
>
>       Map<String, String> props = ImmutableMap.of(
>
>         "type", conf.get("spark.sql.catalog.iceberg.type"),
>
>         "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
>
>         "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
>
>         "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
>
>         "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
>
>       );
>
>       this.icebergCatalog.initialize("iceberg", props);
>
>
>
> Sorry for the potentially n00b question, but I’m a n00b 😃
>
>
>
> Greg
>
>

Re: GlueCatalog example?

Posted by Greg Hill <gn...@paypal.com.INVALID>.
Thanks! Seems I wasn’t too far off then. It’s my understanding that because we’re using EMRFS consistent view, we should not use S3FileIO or the emrfs metadata will get out of sync, but it doesn’t seem like this catalog works with HadoopFileIO so far in my basic testing. I get a NullPointerException because the Hadoop configuration isn’t passed along at some point.

I noticed that I needed to call `setConf()` to get the Hadoop configs into the catalog object.

      Map<String, String> props = ImmutableMap.of(
        "type", "iceberg",
        "warehouse", config.getOutputDir(),
        "lock-impl", "org.apache.iceberg.aws.glue.DynamoLockManager",
        "lock.table", config.getDynamoIcebergLocksTable(),
        "io-impl", "org.apache.iceberg.hadoop.HadoopFileIO"
      );
      this.icebergCatalog.initialize("iceberg", props);

      this.icebergCatalog.setConf(spark.sparkContext().hadoopConfiguration());

Then when I call createTable later:

java.lang.NullPointerException
                at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:481)
                at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
                at org.apache.iceberg.hadoop.Util.getFs(Util.java:48)
                at org.apache.iceberg.hadoop.HadoopOutputFile.fromPath(HadoopOutputFile.java:53)
                at org.apache.iceberg.hadoop.HadoopFileIO.newOutputFile(HadoopFileIO.java:64)
                at org.apache.iceberg.BaseMetastoreTableOperations.writeNewMetadata(BaseMetastoreTableOperations.java:137)
                at org.apache.iceberg.aws.glue.GlueTableOperations.doCommit(GlueTableOperations.java:105)
                at org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:118)
                at org.apache.iceberg.BaseMetastoreCatalog$BaseMetastoreCatalogTableBuilder.create(BaseMetastoreCatalog.java:215)
                at org.apache.iceberg.BaseMetastoreCatalog.createTable(BaseMetastoreCatalog.java:48)
                at org.apache.iceberg.catalog.Catalog.createTable(Catalog.java:105)

The NPE is because `conf` is null in that method, but I verified that icebergCatalog.hadoopConf is the expected object.

Should it be expected that the GlueCatalog can be used with HadoopFileIO or is it only compatible with S3FileIO?

Greg


From: Jack Ye <ye...@gmail.com>
Reply-To: "dev@iceberg.apache.org" <de...@iceberg.apache.org>
Date: Wednesday, July 7, 2021 at 4:16 PM
To: Iceberg Dev List <de...@iceberg.apache.org>
Subject: Re: GlueCatalog example?

This message was identified as a phishing scam.
Yeah this is actually a good point, the documentation is mostly around loading the catalog to different SQL engines and lacks Java API examples. The integration tests are good places to see Java examples: https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2Fmaster%2Faws%2Fsrc%2Fintegration%2Fjava%2Forg%2Fapache%2Ficeberg%2Faws%2Fglue%2FGlueTestBase.java&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168256361%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=dV9Uvdbm4ogsuvADlri%2FuWt2xAuBVA56%2BI8%2Bj3mRs1Y%3D&reserved=0>

-Jack Ye

On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid> wrote:
Is there a Java example for the proper way to get the GlueCatalog object? We are trying to convert from HadoopTables and need access to the lower-level APIs to create and update tables with partitions.

I’m looking for something similar to these examples for HadoopTables and HiveCatalog: https://iceberg.apache.org/java-api-quickstart/<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Ficeberg.apache.org%2Fjava-api-quickstart%2F&data=04%7C01%7Cgnhill%40paypal.com%7Cfc99f00ca0854b626e7208d9418c8c49%7Cfb00791460204374977e21bac5f3f4c8%7C0%7C0%7C637612894168266327%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=luUUropvT0UFzgyVtGjmdosqyf%2BFpRpM3oL0Pnu9tK8%3D&reserved=0>

From what I can gather looking at the code, this is what I came up with (our catalog name is `iceberg`), but it feels like there’s probably a better way that I’m not seeing:

      this.icebergCatalog = new GlueCatalog();
      Configuration conf = spark.sparkContext().hadoopConfiguration();
      Map<String, String> props = ImmutableMap.of(
        "type", conf.get("spark.sql.catalog.iceberg.type"),
        "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
        "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
        "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
        "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
      );
      this.icebergCatalog.initialize("iceberg", props);

Sorry for the potentially n00b question, but I’m a n00b 😃

Greg

Re: GlueCatalog example?

Posted by Jack Ye <ye...@gmail.com>.
Yeah this is actually a good point, the documentation is mostly around
loading the catalog to different SQL engines and lacks Java API examples.
The integration tests are good places to see Java examples:
https://github.com/apache/iceberg/blob/master/aws/src/integration/java/org/apache/iceberg/aws/glue/GlueTestBase.java

-Jack Ye

On Wed, Jul 7, 2021 at 1:27 PM Greg Hill <gn...@paypal.com.invalid> wrote:

> Is there a Java example for the proper way to get the GlueCatalog object?
> We are trying to convert from HadoopTables and need access to the
> lower-level APIs to create and update tables with partitions.
>
>
>
> I’m looking for something similar to these examples for HadoopTables and
> HiveCatalog: https://iceberg.apache.org/java-api-quickstart/
>
>
>
> From what I can gather looking at the code, this is what I came up with
> (our catalog name is `iceberg`), but it feels like there’s probably a
> better way that I’m not seeing:
>
>
>
>       this.icebergCatalog = new GlueCatalog();
>
>       Configuration conf = spark.sparkContext().hadoopConfiguration();
>
>       Map<String, String> props = ImmutableMap.of(
>
>         "type", conf.get("spark.sql.catalog.iceberg.type"),
>
>         "warehouse", conf.get("spark.sql.catalog.iceberg.warehouse"),
>
>         "lock-impl", conf.get("spark.sql.catalog.iceberg.lock-impl"),
>
>         "lock.table", conf.get("spark.sql.catalog.iceberg.lock.table"),
>
>         "io-impl", conf.get("spark.sql.catalog.iceberg.io-impl")
>
>       );
>
>       this.icebergCatalog.initialize("iceberg", props);
>
>
>
> Sorry for the potentially n00b question, but I’m a n00b 😃
>
>
>
> Greg
>