You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@iceberg.apache.org by GitBox <gi...@apache.org> on 2021/06/02 14:11:05 UTC

[GitHub] [iceberg] nastra opened a new pull request #2664: Extract spark appid user

nastra opened a new pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rymurr commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649525353



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -137,7 +138,11 @@ protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
     if (pti.reference() != null) {
       newReference = loadReference(pti.reference());
     }
-    return new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    NessieTableOperations tableOperations =
+        new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    // TODO: is there a better way to pass the catalog options to the TableOperations than this?

Review comment:
       I think it was public because it was created before `CatalogUtil.loadCatalog`. No real reason for it to be public now.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr merged pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rymurr merged pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649020218



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -380,7 +381,12 @@ public boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException
   @Override
   public final void initialize(String name, CaseInsensitiveStringMap options) {
     this.cacheEnabled = Boolean.parseBoolean(options.getOrDefault("cache-enabled", "true"));
-    Catalog catalog = buildIcebergCatalog(name, options);
+
+    Map<String, String> map = new HashMap<>(options.asCaseSensitiveMap());
+    map.put("spark.app.id", SparkSession.active().sparkContext().applicationId());
+    map.put("spark.user", SparkSession.active().sparkContext().sparkUser());

Review comment:
       sgtm




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649270881



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -137,7 +138,11 @@ protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
     if (pti.reference() != null) {
       newReference = loadReference(pti.reference());
     }
-    return new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    NessieTableOperations tableOperations =
+        new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    // TODO: is there a better way to pass the catalog options to the TableOperations than this?

Review comment:
       for some reason I assumed that the signature of the constructor was defined and shouldn't be changed. Definitely better to pass this via the constructor




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649390999



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +92,26 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties(APPLICATION_TYPE, "iceberg");

Review comment:
       also currently I think it's fine if the property name is the same as on the Iceberg side, or do we need to have a clear distinction, wdyt @rymurr ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r644048571



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       Assuming Spark is configured such as in https://github.com/projectnessie/nessie-demos/blob/main/pydemolib/nessiedemo/iceberg_spark.py#L67-L73 I believe we should end up in `SparkSessionCatalog` / `SparkCatalog` first (where those spark options are being set) and then those options are just being passed down to the respective `catalog-impl` class if I read the implementation correctly in `SparkCatalog#initialize(...)` -> `SparkCatalog#buildIcebergCatalog(...)` -> `CatalogUtil.buildIcebergCatalog(...)`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649377109



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -97,6 +97,8 @@ protected Catalog buildIcebergCatalog(String name, CaseInsensitiveStringMap opti
     Configuration conf = SparkSession.active().sessionState().newHadoopConf();
     Map<String, String> optionsMap = new TreeMap<>(String.CASE_INSENSITIVE_ORDER);
     optionsMap.putAll(options);
+    optionsMap.put("app-id", SparkSession.active().sparkContext().applicationId());
+    optionsMap.put("user", SparkSession.active().sparkContext().sparkUser());

Review comment:
       Should these keys be defined in `CatalogProperties`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648771172



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       I agree with @rymurr. The properties passed as `options` in the Spark catalog API are the ones under the catalog name and not general options (unless Spark added this and I don't know about it).
   
   I think it would be better to ensure that the application ID is taken from the Spark context in the [`SparkCatalog` implementation](https://github.com/apache/iceberg/blob/master/spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java#L99). That has already called `SparkSession.active()` so you can get the ID from it directly and add it as an option to the property map. By doing it that way and using a common name, like `app-id`, to pass the app ID, we can set a convention that could also be used by Flink to set the app ID context.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-853419812






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649385925



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +92,26 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties(APPLICATION_TYPE, "iceberg");

Review comment:
       yep that change was on purpose as I wanted those properties to be named the same way (using `-` vs `.`)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-853420804


   > I believe that only a subset of that config is being passed down. Only the `spark.sql.catalog.catalog_name.*` options get passed through.
   
   Also if you have any thoughts on allowing passthrough of options to catalogs @rymurr, please feel free to comment on this issue: https://github.com/apache/iceberg/issues/2607
   
   The issue is specifically about hive options but seems like in general the Nessie catalog might have related concerns based on the above quote. cc @nastra as well if you have input.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649374411



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -137,7 +138,11 @@ protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
     if (pti.reference() != null) {
       newReference = loadReference(pti.reference());
     }
-    return new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    NessieTableOperations tableOperations =
+        new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    // TODO: is there a better way to pass the catalog options to the TableOperations than this?

Review comment:
       @rymurr, is there a reason why the constructor was public?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648768914



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java
##########
@@ -139,26 +131,11 @@ public FileIO io() {
     return fileIO;
   }
 
-  /**
-   * try and get a Spark application id if one exists.
-   *
-   * <p>
-   *   We haven't figured out a general way to pass commit messages through to the Nessie committer yet.
-   *   This is hacky but gets the job done until we can have a more complete commit/audit log.
-   * </p>
-   */
-  private String applicationId() {
-    String appId = null;
-    TableMetadata current = current();
-    if (current != null) {
-      Snapshot snapshot = current.currentSnapshot();
-      if (snapshot != null) {
-        Map<String, String> summary = snapshot.summary();
-        appId = summary.get("spark.app.id");
-      }
-
-    }
-    return appId;
+  Map<String, String> getCatalogOptions() {

Review comment:
       Iceberg rarely uses `get` in function names because it doesn't add anything. The only reason to use `get` is when the function acts as a getter _and_ the class should meet the expectations of a Java bean for some reason, like Spark interoperability. Otherwise, `get` is typically superfluous (as it is in this case) or there is almost certainly another verb that provides more context for _how_ the value is being produced.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick edited a comment on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

kbendick edited a comment on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-853419812


   This might wind up being related to https://github.com/apache/iceberg/issues/2607, so dropping a link for the reference back 🙂 . Thanks for your contributions @nastra!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649286097



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       sgtm, will move stuff to `buildIcebergCatalog`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649388083



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -97,6 +97,8 @@ protected Catalog buildIcebergCatalog(String name, CaseInsensitiveStringMap opti
     Configuration conf = SparkSession.active().sessionState().newHadoopConf();
     Map<String, String> optionsMap = new TreeMap<>(String.CASE_INSENSITIVE_ORDER);
     optionsMap.putAll(options);
+    optionsMap.put("app-id", SparkSession.active().sparkContext().applicationId());
+    optionsMap.put("user", SparkSession.active().sparkContext().sparkUser());

Review comment:
       I was hesitant to move them to a central place initially, but we'll probably will need those at least for Flink as well, so let's move them to `CatalogProperties`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648771747



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -137,7 +138,11 @@ protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
     if (pti.reference() != null) {
       newReference = loadReference(pti.reference());
     }
-    return new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    NessieTableOperations tableOperations =
+        new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    // TODO: is there a better way to pass the catalog options to the TableOperations than this?

Review comment:
       Add it to the constructor? The constructor is public, but I don't think there is a need for it to be public. You could create a package-private one that is called here instead. (Maybe @rymurr can also comment)




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648772510



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       Nevermind, it looks like this forwarding was already done in SparkCatalog below. I think that it's cleaner to do it in `buildIcebergCatalog`, though. That way you don't have to re-create the `CaseInsensitiveStringMap`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick edited a comment on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

kbendick edited a comment on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-853419812


   This might wind up being related to https://github.com/apache/iceberg/issues/2607, so dropping a link for the reference back 🙂 . Thanks for your contributions @nastra!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-855980177


   @kbendick thanks for mentioning https://github.com/apache/iceberg/issues/2607. I've read through it and I don't have any better suggestions than what was already mentioned on https://github.com/apache/iceberg/issues/2607.
   
   Re the similary of https://github.com/apache/iceberg/issues/2607 and https://github.com/apache/iceberg/issues/2664: both are kind of related, but different at the same time. https://github.com/apache/iceberg/issues/2607 is more about overriding hadoop config stuff, whereas https://github.com/apache/iceberg/issues/2664 is about extracting some info at some point and passing it down via the existing properties.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rymurr commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r644022277



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       Does this map actually have the `spark` config options? I thought it was just the subset of catalog options?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649376011



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +92,26 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties(APPLICATION_TYPE, "iceberg");

Review comment:
       Looks like this changes the property keys that are passed to Nessie. Is that okay?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-855982734


   @rdblue given that this approach came up in https://github.com/apache/iceberg/pull/1587/files#r520825564, would you also like to weigh in here?
   We were just discussing with @rymurr whether there's a chance to make the proposed approach from #2664 in a way that is more widely applicable. So far I haven't been able to come up with a better idea unfortunately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rymurr commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r644291245



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       I believe that only a subset of that config is being passed down. Only the `spark.sql.catalog.catalog_name.*` options get passed through.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] kbendick commented on pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

kbendick commented on pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#issuecomment-853419812


   This might wind up being related to https://github.com/apache/iceberg/issues/2607, so dropping a link for the link back 🙂 . Thanks for your contributions @nastra!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649282810



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieTableOperations.java
##########
@@ -139,26 +131,11 @@ public FileIO io() {
     return fileIO;
   }
 
-  /**
-   * try and get a Spark application id if one exists.
-   *
-   * <p>
-   *   We haven't figured out a general way to pass commit messages through to the Nessie committer yet.
-   *   This is hacky but gets the job done until we can have a more complete commit/audit log.
-   * </p>
-   */
-  private String applicationId() {
-    String appId = null;
-    TableMetadata current = current();
-    if (current != null) {
-      Snapshot snapshot = current.currentSnapshot();
-      if (snapshot != null) {
-        Map<String, String> summary = snapshot.summary();
-        appId = summary.get("spark.app.id");
-      }
-
-    }
-    return appId;
+  Map<String, String> getCatalogOptions() {

Review comment:
       given that we're passing `catalogOptions` via the constructor, we can remove that method again




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rymurr commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rymurr commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649526852



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -137,7 +138,11 @@ protected TableOperations newTableOps(TableIdentifier tableIdentifier) {
     if (pti.reference() != null) {
       newReference = loadReference(pti.reference());
     }
-    return new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    NessieTableOperations tableOperations =
+        new NessieTableOperations(NessieUtil.toKey(pti.tableIdentifier()), newReference, client, fileIO);
+    // TODO: is there a better way to pass the catalog options to the TableOperations than this?

Review comment:
       ahhh excuse me, you meant the TableOps. That was just a mistake I suspect. Even less of a reason for that to be public.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648773010



##########
File path: spark3/src/main/java/org/apache/iceberg/spark/SparkCatalog.java
##########
@@ -380,7 +381,12 @@ public boolean dropNamespace(String[] namespace) throws NoSuchNamespaceException
   @Override
   public final void initialize(String name, CaseInsensitiveStringMap options) {
     this.cacheEnabled = Boolean.parseBoolean(options.getOrDefault("cache-enabled", "true"));
-    Catalog catalog = buildIcebergCatalog(name, options);
+
+    Map<String, String> map = new HashMap<>(options.asCaseSensitiveMap());
+    map.put("spark.app.id", SparkSession.active().sparkContext().applicationId());
+    map.put("spark.user", SparkSession.active().sparkContext().sparkUser());

Review comment:
       I think we should use generic property names here so that we could do the same thing from Flink or other similar catalog wrapper classes to add common context. Instead of `spark.user` we should maybe use just `user` (if it isn't set) and instead of `spark.app.id` use `app-id`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r646407274



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieCatalog.java
##########
@@ -80,12 +79,14 @@
   private UpdateableReference reference;
   private String name;
   private FileIO fileIO;
+  private Map<String, String> catalogOptions;
 
   public NessieCatalog() {
   }
 
   @Override
   public void initialize(String inputName, Map<String, String> options) {
+    this.catalogOptions = ImmutableMap.copyOf(options);

Review comment:
       I have an integration test with Spark in https://github.com/apache/iceberg/commit/dac4ab8b2dcb4faf4e974a0398dad0d4a4c84c82#diff-a69dddda28c4edb1f19f2f9e84de0e74a76fcf9e231b6975a72fab193b003733R146 that makes sure that the `spark.app.id` is set correctly. Unfortunately I can't commit those changes atm, because it  requires a few other changes which are also not yet committed to `master`. Given that, it seems those options are passed down properly to the `NessieCatalog`, unless I'm missing some logic where those things are explicitly filtered out and not passed further down?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] nastra commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

nastra commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649281347



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +90,25 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties("application.type", "iceberg");
+    if (catalogOptions.containsKey(SPARK_APP_ID)) {
+      cm.putProperties("spark.app.id", catalogOptions.get(SPARK_APP_ID));
+    }
+
+    return cm.build();
+  }
+
   /**
-   * @return The current OS user as defined by the system property <b>user.name</b> or an empty string
+   * @param catalogOptions The options where to look for the <b>spark.user</b>
+   * @return The author that can be used for a commit, which is either the <b>spark.user</b> from the given
+   * <code>catalogOptions</code> or the logged in user as defined in the <b>user.name</b> JVM properties.
    */
-  static String getCommitAuthor() {
-    return System.getProperty("user.name", "");
+  private static String getCommitAuthor(Map<String, String> catalogOptions) {

Review comment:
       you're right about returning an empty string being a red flag. Actually it's completely fine if the author can't be found and therefore is set to `null` in the `CommitMeta`. I'll remove returning an empty string and annotate `getCommitAuthor()` with `@Nullable`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r648769586



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +90,25 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties("application.type", "iceberg");
+    if (catalogOptions.containsKey(SPARK_APP_ID)) {
+      cm.putProperties("spark.app.id", catalogOptions.get(SPARK_APP_ID));
+    }
+
+    return cm.build();
+  }
+
   /**
-   * @return The current OS user as defined by the system property <b>user.name</b> or an empty string
+   * @param catalogOptions The options where to look for the <b>spark.user</b>
+   * @return The author that can be used for a commit, which is either the <b>spark.user</b> from the given
+   * <code>catalogOptions</code> or the logged in user as defined in the <b>user.name</b> JVM properties.
    */
-  static String getCommitAuthor() {
-    return System.getProperty("user.name", "");
+  private static String getCommitAuthor(Map<String, String> catalogOptions) {

Review comment:
       Why not return an `Option`? Returning an empty string is a red flag that you're producing bad data. If `.author` in `buildCommitMetadata` is required, then it would be better to put the empty string there and add a comment explaining that it is required by the Nessie API.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org

[GitHub] [iceberg] rdblue commented on a change in pull request #2664: Nessie: Extract Spark AppId / User from SparkContext and not from Snapshot

Posted by GitBox <gi...@apache.org>.

rdblue commented on a change in pull request #2664:
URL: https://github.com/apache/iceberg/pull/2664#discussion_r649376272



##########
File path: nessie/src/main/java/org/apache/iceberg/nessie/NessieUtil.java
##########
@@ -82,10 +92,26 @@ static ContentsKey toKey(TableIdentifier tableIdentifier) {
     return key;
   }
 
+  static CommitMeta buildCommitMetadata(String commitMsg, Map<String, String> catalogOptions) {
+    Preconditions.checkArgument(null != catalogOptions, "catalogOptions must not be null");
+    ImmutableCommitMeta.Builder cm = CommitMeta.builder().message(commitMsg)
+        .author(NessieUtil.getCommitAuthor(catalogOptions));
+    cm.putProperties(APPLICATION_TYPE, "iceberg");
+    if (catalogOptions.containsKey(APP_ID)) {
+      cm.putProperties(APP_ID, catalogOptions.get(APP_ID));
+    }
+
+    return cm.build();
+  }
+
   /**
-   * @return The current OS user as defined by the system property <b>user.name</b> or an empty string
+   * @param catalogOptions The options where to look for the <b>user</b>
+   * @return The author that can be used for a commit, which is either the <b>user</b> from the given
+   * <code>catalogOptions</code> or the logged in user as defined in the <b>user.name</b> JVM properties.
    */
-  static String getCommitAuthor() {
-    return System.getProperty("user.name", "");
+  @Nullable
+  private static String getCommitAuthor(Map<String, String> catalogOptions) {

Review comment:
       Nit: another useless `get` in a method name.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@iceberg.apache.org
For additional commands, e-mail: issues-help@iceberg.apache.org