You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2020/06/12 00:37:06 UTC

[GitHub] [hudi] vinothchandar commented on a change in pull request #1678: [HUDI-242] Metadata Bootstrap changes

vinothchandar commented on a change in pull request #1678:
URL: https://github.com/apache/hudi/pull/1678#discussion_r439127629



##########
File path: .travis.yml
##########
@@ -13,14 +13,19 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
+os: linux
 language: java
 dist: trusty
 jdk:
   - openjdk8
-sudo: required
-env:
-  - HUDI_QUIETER_LOGGING=1 TEST_SUITE=unit
-  - TEST_SUITE=integration
+jobs:

Review comment:
       So `hudi-common` will be run for both? 

##########
File path: hudi-cli/src/main/java/org/apache/hudi/cli/commands/StatsCommand.java
##########
@@ -124,13 +125,15 @@ public String fileSizeStats(
     Histogram globalHistogram = new Histogram(new UniformReservoir(MAX_FILES));
     HashMap<String, Histogram> commitHistoMap = new HashMap<>();
     for (FileStatus fileStatus : statuses) {
-      String instantTime = FSUtils.getCommitTime(fileStatus.getPath().getName());
-      long sz = fileStatus.getLen();
-      if (!commitHistoMap.containsKey(instantTime)) {
-        commitHistoMap.put(instantTime, new Histogram(new UniformReservoir(MAX_FILES)));
+      if (!fileStatus.getPath().toString().contains(HoodieTableMetaClient.METAFOLDER_NAME)) {

Review comment:
       I wonder if such a listing should be moved inside a Utils methods in FSUtils and not guarded at the CLI level? 

##########
File path: docker/demo/sparksql-batch2.commands
##########
@@ -26,4 +26,14 @@ spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from s
 spark.sql("select symbol, max(ts) from stock_ticks_mor_rt group by symbol HAVING symbol = 'GOOG'").show(100, false)
 spark.sql("select `_hoodie_commit_time`, symbol, ts, volume, open, close  from stock_ticks_mor_rt where  symbol = 'GOOG'").show(100, false)
 
+ // Copy-On-Write Bootstrapped table

Review comment:
       wondering if we really need normal and bootstrapped tables are two cases here.. Is nt it not enough to just to cover at the partition level? i.e turn the one of the partitions in the the existing tables to be bootstrapped? that way we don't expand the existing test runtimes. 

##########
File path: hudi-cli/pom.xml
##########
@@ -147,6 +147,43 @@
           </execution>
         </executions>
       </plugin>
+<!--

Review comment:
       if its going to be commented out, remove? 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/FullBootstrapInputProvider.java
##########
@@ -0,0 +1,58 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap;
+
+import java.io.Serializable;
+import org.apache.hudi.avro.model.HoodieFileStatus;
+import org.apache.hudi.common.config.TypedProperties;
+import org.apache.hudi.common.model.HoodieRecord;
+import org.apache.hudi.common.util.collection.Pair;
+
+import org.apache.log4j.LogManager;
+import org.apache.log4j.Logger;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import java.util.List;
+
+/**
+ * Creates RDD of Hoodie Records given a list of partitions.
+ */
+public abstract class FullBootstrapInputProvider implements Serializable {

Review comment:
       make an interface for this?  that way we can even bootstrap data out of other table formats, by writing a simple class. 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/selector/BootstrapRegexModeSelector.java
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap.selector;
+
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.hudi.avro.model.HoodieFileStatus;
+import org.apache.hudi.client.bootstrap.BootstrapMode;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+public class BootstrapRegexModeSelector extends BootstrapModeSelector {
+
+  private final Pattern pattern;
+  private final BootstrapMode bootstrapModeOnMatch;
+  private final BootstrapMode defaultMode;
+
+  public BootstrapRegexModeSelector(HoodieWriteConfig writeConfig) {
+    super(writeConfig);
+    this.pattern = Pattern.compile(writeConfig.getBootstrapModeSelectorRegex());
+    this.bootstrapModeOnMatch = writeConfig.getBootstrapModeForRegexMatch();
+    this.defaultMode = BootstrapMode.FULL_BOOTSTRAP.equals(bootstrapModeOnMatch)

Review comment:
       this part is bit confusing.... so we take a regex to match, and the mode (full/metadata).. we apply the configured mode to matching partitions.. and the other mode to the rest? is it better to just make another flag with sensible defaults for both? I can foresee having to explain this to users.. 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/selector/BootstrapModeSelector.java
##########
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap.selector;
+
+import org.apache.hudi.avro.model.HoodieFileStatus;
+import org.apache.hudi.client.bootstrap.BootstrapMode;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+import java.io.Serializable;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * Pluggable Partition Selector for selecting partitions to perform full or metadata-only bootstrapping.
+ */
+public abstract class BootstrapModeSelector implements Serializable {
+
+  protected final HoodieWriteConfig writeConfig;
+
+  public BootstrapModeSelector(HoodieWriteConfig writeConfig) {
+    this.writeConfig = writeConfig;
+  }
+
+  /**
+   * Classify partitions for the purpose of bootstrapping. For a non-partitioned source, input list will be one entry.
+   *
+   * @param partitions List of partitions with files present in each partitions
+   * @return a partitions grouped by bootstrap mode
+   */
+  public abstract Map<BootstrapMode, List<String>> select(
+      List<Pair<String, List<HoodieFileStatus>>> partitions);
+}

Review comment:
       nit: new line 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##########
@@ -147,6 +147,35 @@ public static SparkConf registerClasses(SparkConf conf) {
     return recordsWithLocation.filter(v1 -> !v1.isCurrentLocationKnown());
   }
 
+  /**
+   * Main API to run bootstrap to hudi.
+   */
+  public void bootstrap(Option<Map<String, String>> extraMetadata) {

Review comment:
       I would kind of expect the `bootStrapFromPath` to be passed in here? I guess that's part of the HoodieWriteConfig? 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##########
@@ -146,6 +146,35 @@ public static SparkConf registerClasses(SparkConf conf) {
     return recordsWithLocation.filter(v1 -> !v1.isCurrentLocationKnown());
   }
 
+  /**
+   * Main API to run bootstrap to hudi.
+   */
+  public void bootstrap(Option<Map<String, String>> extraMetadata) {
+    if (rollbackPending) {
+      rollBackPendingBootstrap();
+    }
+    HoodieTable<T> table = getTableAndInitCtx(WriteOperationType.UPSERT);
+    table.bootstrap(jsc, extraMetadata);
+  }
+
+  /**
+   * Main API to rollback pending bootstrap.
+   */
+  protected void rollBackPendingBootstrap() {

Review comment:
       On the naming, I feel we should stick to either inflight, requested or complete i.e rollbackInflightBootstrap or rollbackIncompleteRollback(). (pending is a new term, we are introducing?) . may be this ship has already sailed 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/BootstrapMode.java
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap;
+
+public enum BootstrapMode {

Review comment:
       javadocs

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/selector/BootstrapRegexModeSelector.java
##########
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap.selector;
+
+import java.util.List;
+import java.util.Map;
+import java.util.regex.Pattern;
+import java.util.stream.Collectors;
+import org.apache.hudi.avro.model.HoodieFileStatus;
+import org.apache.hudi.client.bootstrap.BootstrapMode;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+public class BootstrapRegexModeSelector extends BootstrapModeSelector {
+
+  private final Pattern pattern;
+  private final BootstrapMode bootstrapModeOnMatch;
+  private final BootstrapMode defaultMode;
+
+  public BootstrapRegexModeSelector(HoodieWriteConfig writeConfig) {
+    super(writeConfig);
+    this.pattern = Pattern.compile(writeConfig.getBootstrapModeSelectorRegex());
+    this.bootstrapModeOnMatch = writeConfig.getBootstrapModeForRegexMatch();
+    this.defaultMode = BootstrapMode.FULL_BOOTSTRAP.equals(bootstrapModeOnMatch)
+        ? BootstrapMode.METADATA_ONLY_BOOTSTRAP : BootstrapMode.FULL_BOOTSTRAP;
+    System.out.println("Default Mode :" + defaultMode + ", on Match Mode :" + bootstrapModeOnMatch);

Review comment:
       logger? 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/HoodieWriteClient.java
##########
@@ -146,6 +146,35 @@ public static SparkConf registerClasses(SparkConf conf) {
     return recordsWithLocation.filter(v1 -> !v1.isCurrentLocationKnown());
   }
 
+  /**
+   * Main API to run bootstrap to hudi.
+   */
+  public void bootstrap(Option<Map<String, String>> extraMetadata) {
+    if (rollbackPending) {
+      rollBackPendingBootstrap();
+    }
+    HoodieTable<T> table = getTableAndInitCtx(WriteOperationType.UPSERT);
+    table.bootstrap(jsc, extraMetadata);
+  }
+
+  /**
+   * Main API to rollback pending bootstrap.
+   */
+  protected void rollBackPendingBootstrap() {

Review comment:
       for testing it still works if we drop the `protected` correct? keeping it package private

##########
File path: hudi-cli/pom.xml
##########
@@ -202,6 +237,12 @@
       <type>test-jar</type>
     </dependency>
 
+    <dependency>

Review comment:
       should we remove other dependencies from this pom then?  hudi-common, hudi-client etc will be inside the utilities bundle anyway 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/BootstrapMode.java
##########
@@ -0,0 +1,24 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap;
+
+public enum BootstrapMode {
+  FULL_BOOTSTRAP,
+  METADATA_ONLY_BOOTSTRAP

Review comment:
       rename to 
   `RECORD_DATA_BOOTSTRAP`, `RECORD_METADATA_BOOTSTRAP` (or)
   just `THICK_BOOTSTRAP`, `THIN_BOOTSTRAP` that more user understandable and relatable to eachother? 
   We atleast need to call it `RECORD_METADATA_ONLY_BOOTSTRAP` since now, it feels like we are just copying file metadata.. which we are not.. 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/BootstrapSourceSchemaProvider.java
##########
@@ -0,0 +1,77 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap;
+
+import org.apache.hudi.avro.model.HoodieFileStatus;
+import org.apache.hudi.common.bootstrap.FileStatusUtils;
+import org.apache.hudi.common.util.ParquetUtils;
+import org.apache.hudi.common.util.collection.Pair;
+import org.apache.hudi.config.HoodieWriteConfig;
+import org.apache.spark.api.java.JavaSparkContext;
+
+import org.apache.avro.Schema;
+import org.apache.hadoop.fs.Path;
+
+import java.util.List;
+
+/**
+ * Bootstrap Schema Provider. Schema provided in config is used. If not available, use schema from Parquet
+ */
+public class BootstrapSourceSchemaProvider {
+
+  protected final HoodieWriteConfig bootstrapConfig;

Review comment:
       rename to writeConfig, unless this only has bootstrap related properties 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/translator/MetadataBootstrapPartitionPathTranslator.java
##########
@@ -16,28 +16,23 @@
  * limitations under the License.
  */
 
-package org.apache.hudi.keygen;
-
-import org.apache.hudi.common.config.TypedProperties;
-import org.apache.hudi.common.model.HoodieKey;
-
-import org.apache.avro.generic.GenericRecord;
+package org.apache.hudi.client.bootstrap.translator;

Review comment:
       why is linkedin against the keygenerator file? 

##########
File path: hudi-client/src/main/java/org/apache/hudi/client/bootstrap/selector/MetadataOnlyBootstrapModeSelector.java
##########
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.hudi.client.bootstrap.selector;
+
+import org.apache.hudi.client.bootstrap.BootstrapMode;
+import org.apache.hudi.config.HoodieWriteConfig;
+
+public class MetadataOnlyBootstrapModeSelector extends UniBootstrapModeSelector {

Review comment:
       we need to rename these classes nicely as well 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org