You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by yi...@apache.org on 2022/08/30 05:00:40 UTC
[hudi] branch asf-site updated: [DOCS] Update migration_guide.md (#6275)

This is an automated email from the ASF dual-hosted git repository.

yihua pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 4fc0d427a0 [DOCS] Update migration_guide.md (#6275)
4fc0d427a0 is described below

commit 4fc0d427a00cd650057c0458e3a596dfb1d58e9d
Author: Manu <36...@users.noreply.github.com>
AuthorDate: Tue Aug 30 13:00:31 2022 +0800

    [DOCS] Update migration_guide.md (#6275)
    
    Co-authored-by: Y Ethan Guo <et...@gmail.com>
---
 website/docs/migration_guide.md                    | 42 +++++++++++++---------
 .../version-0.11.1/migration_guide.md              | 42 +++++++++++++---------
 .../version-0.12.0/migration_guide.md              | 42 +++++++++++++---------
 3 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/website/docs/migration_guide.md b/website/docs/migration_guide.md
index e7dd5c29d7..449d65c376 100644
--- a/website/docs/migration_guide.md
+++ b/website/docs/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since all the data is Hudi
 There are a few options when choosing this approach.
 
 **Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with --run-bootstrap command line option. There are two types of bootstrap,
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset.
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \
+--hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+``` 
 
 **Option 2**
 For huge tables, this could be as simple as : 
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
 
 **Option 3**
 Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+[here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
-hudi->hdfsparquetimport
-        --upsert false
-        --srcPath /user/parquet/table/basepath
-        --targetPath /user/hoodie/table/basepath
-        --tableName hoodie_table
-        --tableType COPY_ON_WRITE
-        --rowKeyField _row_key
-        --partitionPathField partitionStr
-        --parallelism 1500
-        --schemaFilePath /user/table/schema
-        --format parquet
-        --sparkMemory 6g
-        --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run".
diff --git a/website/versioned_docs/version-0.11.1/migration_guide.md b/website/versioned_docs/version-0.11.1/migration_guide.md
index e7dd5c29d7..7f5ccf2d9c 100644
--- a/website/versioned_docs/version-0.11.1/migration_guide.md
+++ b/website/versioned_docs/version-0.11.1/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since all the data is Hudi
 There are a few options when choosing this approach.
 
 **Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with --run-bootstrap command line option. There are two types of bootstrap, 
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset. 
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.    
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \
+--hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+``` 
 
 **Option 2**
 For huge tables, this could be as simple as : 
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
 
 **Option 3**
 Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+ [here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
-hudi->hdfsparquetimport
-        --upsert false
-        --srcPath /user/parquet/table/basepath
-        --targetPath /user/hoodie/table/basepath
-        --tableName hoodie_table
-        --tableType COPY_ON_WRITE
-        --rowKeyField _row_key
-        --partitionPathField partitionStr
-        --parallelism 1500
-        --schemaFilePath /user/table/schema
-        --format parquet
-        --sparkMemory 6g
-        --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run".
diff --git a/website/versioned_docs/version-0.12.0/migration_guide.md b/website/versioned_docs/version-0.12.0/migration_guide.md
index e7dd5c29d7..fa5b663f56 100644
--- a/website/versioned_docs/version-0.12.0/migration_guide.md
+++ b/website/versioned_docs/version-0.12.0/migration_guide.md
@@ -36,8 +36,29 @@ Import your existing table into a Hudi managed table. Since all the data is Hudi
 There are a few options when choosing this approach.
 
 **Option 1**
-Use the HDFSParquetImporter tool. As the name suggests, this only works if your existing table is in parquet file format.
-This tool essentially starts a Spark Job to read the existing parquet table and converts it into a HUDI managed table by re-writing all the data.
+Use the HoodieDeltaStreamer tool. HoodieDeltaStreamer supports bootstrap with --run-bootstrap command line option. There are two types of bootstrap,
+METADATA_ONLY and FULL_RECORD. METADATA_ONLY will generate just skeleton base files with keys/footers, avoiding full cost of rewriting the dataset.
+FULL_RECORD will perform a full copy/rewrite of the data as a Hudi table.
+
+Here is an example for running FULL_RECORD bootstrap and keeping hive style partition with HoodieDeltaStreamer.
+```
+spark-submit --master local \
+--conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer `ls packaging/hudi-utilities-bundle/target/hudi-utilities-bundle-*.jar` \
+--run-bootstrap \
+--target-base-path /tmp/hoodie/bootstrap_table \
+--target-table bootstrap_table \
+--table-type COPY_ON_WRITE \
+--hoodie-conf hoodie.bootstrap.base.path=/tmp/source_table \
+--hoodie-conf hoodie.datasource.write.recordkey.field=${KEY_FIELD} \
+--hoodie-conf hoodie.datasource.write.partitionpath.field=${PARTITION_FIELD} \
+--hoodie-conf hoodie.datasource.write.precombine.field=${PRECOMBINE_FILED} \
+--hoodie-conf hoodie.bootstrap.keygen.class=org.apache.hudi.keygen.SimpleKeyGenerator \
+--hoodie-conf hoodie.bootstrap.full.input.provider=org.apache.hudi.bootstrap.SparkParquetBootstrapDataProvider \
+--hoodie-conf hoodie.bootstrap.mode.selector=org.apache.hudi.client.bootstrap.selector.BootstrapRegexModeSelector \
+--hoodie-conf hoodie.bootstrap.mode.selector.regex.mode=FULL_RECORD \
+--hoodie-conf hoodie.datasource.write.hive_style_partitioning=true
+``` 
 
 **Option 2**
 For huge tables, this could be as simple as : 
@@ -50,21 +71,10 @@ for partition in [list of partitions in source table] {
 
 **Option 3**
 Write your own custom logic of how to load an existing table into a Hudi managed one. Please read about the RDD API
- [here](/docs/quick-start-guide). Using the HDFSParquetImporter Tool. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
+[here](/docs/quick-start-guide). Using the bootstrap run CLI. Once hudi has been built via `mvn clean install -DskipTests`, the shell can be
 fired by via `cd hudi-cli && ./hudi-cli.sh`.
 
 ```java
-hudi->hdfsparquetimport
-        --upsert false
-        --srcPath /user/parquet/table/basepath
-        --targetPath /user/hoodie/table/basepath
-        --tableName hoodie_table
-        --tableType COPY_ON_WRITE
-        --rowKeyField _row_key
-        --partitionPathField partitionStr
-        --parallelism 1500
-        --schemaFilePath /user/table/schema
-        --format parquet
-        --sparkMemory 6g
-        --retry 2
+hudi->bootstrap run --srcPath /tmp/source_table --targetPath /tmp/hoodie/bootstrap_table --tableName bootstrap_table --tableType COPY_ON_WRITE --rowKeyField ${KEY_FIELD} --partitionPathField ${PARTITION_FIELD} --sparkMaster local --hoodieConfigs hoodie.datasource.write.hive_style_partitioning=true --selectorClass org.apache.hudi.client.bootstrap.selector.FullRecordBootstrapModeSelector
 ```
+Unlike deltaStream, FULL_RECORD or METADATA_ONLY is set with --selectorClass, see detalis with help "bootstrap run".
\ No newline at end of file