You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by xu...@apache.org on 2022/04/26 14:54:36 UTC

[hudi] branch asf-site updated: [HUDI-3927][DOCS] Add datahub and glue sync to metasync section on website (#5438)

This is an automated email from the ASF dual-hosted git repository.

xushiyan pushed a commit to branch asf-site
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/asf-site by this push:
     new 6803eef2fc [HUDI-3927][DOCS] Add datahub and glue sync to metasync section on website (#5438)
6803eef2fc is described below

commit 6803eef2fce1b06c9e03cfd816ad3f854ec9295c
Author: Raymond Xu <27...@users.noreply.github.com>
AuthorDate: Tue Apr 26 07:54:28 2022 -0700

    [HUDI-3927][DOCS] Add datahub and glue sync to metasync section on website (#5438)
---
 website/docs/syncing_aws_glue_data_catalog.md | 18 ++++++++++
 website/docs/syncing_datahub.md               | 49 +++++++++++++++++++++++++++
 website/docs/syncing_metastore.md             |  2 +-
 website/sidebars.js                           | 10 +++++-
 4 files changed, 77 insertions(+), 2 deletions(-)

diff --git a/website/docs/syncing_aws_glue_data_catalog.md b/website/docs/syncing_aws_glue_data_catalog.md
new file mode 100644
index 0000000000..6cb724b551
--- /dev/null
+++ b/website/docs/syncing_aws_glue_data_catalog.md
@@ -0,0 +1,18 @@
+---
+title: Sync to AWS Glue Data Catalog
+keywords: [hudi, aws, glue, sync]
+---
+
+Hudi tables can sync to AWS Glue Data Catalog directly via AWS SDK. Piggyback on `HiveSyncTool`
+, `org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool` makes use of all the configurations that are taken by `HiveSyncTool`
+and send them to AWS Glue.
+
+### Configurations
+
+There is no additional configuration for using `AwsGlueCatalogSyncTool`; you just need to set it as one of the sync tool
+classes for `HoodieDeltaStreamer` and everything configured as shown in [Sync to Hive Metastore](syncing_metastore) will
+be passed along.
+
+```shell
+--sync-tool-classes org.apache.hudi.aws.sync.AwsGlueCatalogSyncTool
+```
diff --git a/website/docs/syncing_datahub.md b/website/docs/syncing_datahub.md
new file mode 100644
index 0000000000..a294f339f3
--- /dev/null
+++ b/website/docs/syncing_datahub.md
@@ -0,0 +1,49 @@
+---
+title: Sync to DataHub
+keywords: [hudi, datahub, sync]
+---
+
+[DataHub](https://datahubproject.io/) is a rich metadata platform that supports features like data discovery, data
+obeservability, federated governance, etc.
+
+In Hudi 0.11.0, you can now sync to a DataHub instance by setting `DataHubSyncTool` as one of the sync tool classes
+for `HoodieDeltaStreamer`.
+
+The target Hudi table will be sync'ed to DataHub as a `Dataset`. The Hudi table's avro schema will be sync'ed, along
+with the commit timestamp when running the sync.
+
+### Configurations
+
+`DataHubSyncTool` makes use of DataHub's Java Emitter to send the metadata via HTTP REST APIs. It is required to
+set `hoodie.meta.sync.datahub.emitter.server` to the URL of the DataHub instance for sync.
+
+If needs auth token, set `hoodie.meta.sync.datahub.emitter.token`.
+
+If needs customized creation of the emitter object,
+implement `org.apache.hudi.sync.datahub.config.DataHubEmitterSupplier` and supply the implementation's FQCN
+to `hoodie.meta.sync.datahub.emitter.supplier.class`.
+
+By default, the sync config's database name and table name will be used to make the target `Dataset`'s URN.
+Subclass `HoodieDataHubDatasetIdentifier` and set it to `hoodie.meta.sync.datahub.dataset.identifier.class` to customize
+the URN creation.
+
+### Example
+
+The following shows an example configuration to run `HoodieDeltaStreamer` with `DataHubSyncTool`.
+
+In addition to `hudi-utilities-bundle` that contains `HoodieDeltaStreamer`, you also add `hudi-datahub-sync-bundle` to
+the classpath.
+
+```shell
+spark-submit --master yarn \
+--jars /opt/hudi-datahub-sync-bundle-0.11.0.jar \
+--class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer \
+/opt/hudi-utilities-bundle_2.12-0.11.0.jar \
+--target-table mytable \
+# ... other HoodieDeltaStreamer's configs
+--enable-sync \
+--sync-tool-classes org.apache.hudi.sync.datahub.DataHubSyncTool \
+--hoodie-conf hoodie.meta.sync.datahub.emitter.server=http://url-to-datahub-instance:8080 \
+--hoodie-conf hoodie.datasource.hive_sync.database=mydb \
+--hoodie-conf hoodie.datasource.hive_sync.table=mytable \
+```
diff --git a/website/docs/syncing_metastore.md b/website/docs/syncing_metastore.md
index 3e9cf4b616..f1c1fdc582 100644
--- a/website/docs/syncing_metastore.md
+++ b/website/docs/syncing_metastore.md
@@ -1,5 +1,5 @@
 ---
-title: Syncing to Metastore
+title: Sync to Hive Metastore
 keywords: [hudi, hive, sync]
 ---
 
diff --git a/website/sidebars.js b/website/sidebars.js
index 5589aece3f..a7be23c8b0 100644
--- a/website/sidebars.js
+++ b/website/sidebars.js
@@ -49,7 +49,15 @@ module.exports = {
                 'hoodie_deltastreamer',
                 'querying_data',
                 'flink_configuration',
-                'syncing_metastore',
+                {
+                    type: 'category',
+                    label: 'Sync to Metastore',
+                    items: [
+                        'syncing_aws_glue_data_catalog',
+                        'syncing_datahub',
+                        'syncing_metastore'
+                    ],
+                }
             ],
         },
         {