You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@gobblin.apache.org by GitBox <gi...@apache.org> on 2021/01/05 02:04:02 UTC
[GitHub] [incubator-gobblin] aplex commented on a change in pull request #3178: [GOBBLIN-1339] Add cluster name to dataset descriptor

aplex commented on a change in pull request #3178:
URL: https://github.com/apache/incubator-gobblin/pull/3178#discussion_r551674731



##########
File path: gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
 
 package org.apache.gobblin.dataset;
 
-import java.util.Map;
-
 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.Maps;
-
 import lombok.Getter;
 
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
 
 /**
  * A {@link Descriptor} identifies and provides metadata to describe a dataset
  */
 public class DatasetDescriptor extends Descriptor {
   private static final String PLATFORM_KEY = "platform";
   private static final String NAME_KEY = "name";
+  private static final String CLUSTER_NAME_KEY = "clusterName";
 
   /**
    * which platform the dataset is stored, for example: local, hdfs, oracle, mysql, kafka
    */
   @Getter
   private final String platform;
 
+  /**
+   * Human-readeable cluster name.
+   *
+   * @see org.apache.gobblin.util.ClustersNames
+   */
+  @Getter
+  @Nullable
+  private final String clusterName;

Review comment:
       Initially, I was thinking about putting a well-known cluster name like "snap" into this field, and for unknown clusters put a DNS name like "datalake.corp.com". Then lineage events will include this name as-is, and it can be displayed to users directly.  We already use "cluster name" concept in org.apache.gobblin.util.ClustersNames, for getting nice names from yarn/hdfs urls.
   
   However, this well-know name is not easily convertible back to domain name. Single well-known name can map to different urls for hdfs/hive. So, there could be a benefit in keeping a hostname instead of clustername, and converting it to human-readable name before reporting lineage. 
   
   DatasetDescriptor already looks like a dataset URL with non-standard field names:
   - platform is like url.protocol - "hdfs", "kafka", "mysql"
   - name is like url.path - "table1" or "/data/tracking/test123"
   - clusterName is like a url.host - "my-cluster" or "my-cluster.company.com"
   
   Regarding the fields you suggested:
   - I don't think we have availability zone information anywhere.
   - We can get well-known nice name from hostname using org.apache.gobblin.util.ClustersNames , so don't have to store it separately.
   - Keeping port information is discussable. I would assume that any production deployments will have separate DNS names for different clusters. However, for small scale, temporary deployments it's possible to have multiple Hives or namenodes on a single host, running on different ports.
   
   

##########
File path: gobblin-api/src/main/java/org/apache/gobblin/dataset/DatasetDescriptor.java
##########
@@ -17,34 +17,52 @@
 
 package org.apache.gobblin.dataset;
 
-import java.util.Map;
-
 import com.google.common.collect.ImmutableMap;
 import com.google.common.collect.Maps;
-
 import lombok.Getter;
 
+import javax.annotation.Nullable;
+import java.util.Map;
+import java.util.Objects;
+
 
 /**
  * A {@link Descriptor} identifies and provides metadata to describe a dataset
  */
 public class DatasetDescriptor extends Descriptor {

Review comment:
       I think inside Gobblin codebase, we work only with dataset instances. When we copy or ingest stuff, we always do it on a specific hdfs/kafka/other cluster, rather than on abstract multi-cluster dataset.  
   
   In practice, the DatasetDescriptor  is used for lineage reporting.
   
   It's not mandatory for backward-compatibility. In the OSS code base, there are a bunch of Salesforce/MySql connectors that don't report cluster names at the moment, although they can do it as well. There also could be usages in other libraries. 
   
   I think in the final state, cluster name should be mandatory in this class. For transitional period, we can get there gradually with marking old constructor obsolete and clustername optional.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org