You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by vi...@apache.org on 2022/04/05 15:58:20 UTC

[hudi] branch master updated: [HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)

This is an automated email from the ASF dual-hosted git repository.

vinoth pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/hudi.git


The following commit(s) were added to refs/heads/master by this push:
     new 92ca426ab7 [HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)
92ca426ab7 is described below

commit 92ca426ab7ef54c8ce928f770783f4f83510d8ae
Author: Vinoth Govindarajan <vi...@uber.com>
AuthorDate: Tue Apr 5 08:58:13 2022 -0700

    [HUDI-2319] dbt example models to demonstrate hudi dbt integration (#5220)
    
    * dbt example models to demonstrate hudi dbt integration
    
    * Fixed readme text
---
 hudi-examples/hudi-examples-dbt/.gitignore         |   8 ++
 hudi-examples/hudi-examples-dbt/README.md          | 118 +++++++++++++++++++++
 hudi-examples/hudi-examples-dbt/dbt_project.yml    |  49 +++++++++
 .../models/example/hudi_insert_overwrite_table.sql |  37 +++++++
 .../models/example/hudi_insert_table.sql           |  42 ++++++++
 .../example/hudi_upsert_partitioned_cow_table.sql  |  40 +++++++
 .../example/hudi_upsert_partitioned_mor_table.sql  |  40 +++++++
 .../models/example/hudi_upsert_table.sql           |  38 +++++++
 .../hudi-examples-dbt/models/example/schema.yml    | 104 ++++++++++++++++++
 9 files changed, 476 insertions(+)

diff --git a/hudi-examples/hudi-examples-dbt/.gitignore b/hudi-examples/hudi-examples-dbt/.gitignore
new file mode 100644
index 0000000000..0eb3fd035d
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/.gitignore
@@ -0,0 +1,8 @@
+target/
+dbt_modules/
+logs/
+.tox/
+.idea/
+.DS_Store
+.vscode
+*.log
diff --git a/hudi-examples/hudi-examples-dbt/README.md b/hudi-examples/hudi-examples-dbt/README.md
new file mode 100644
index 0000000000..52a6ea3ac8
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/README.md
@@ -0,0 +1,118 @@
+## Testing dbt project: `hudi_examples_dbt`
+
+This dbt project transforms demonstrates hudi integration with dbt, it has a few models to demonstrate the different ways in which you can create hudi datasets using dbt.
+
+### What is this repo?
+What this repo _is_:
+- A self-contained playground dbt project, useful for testing out scripts, and communicating some of the core dbt concepts.
+
+### Running this project
+To get up and running with this project:
+1. Install dbt using [these instructions](https://docs.getdbt.com/docs/installation).
+
+2. Install [dbt-spark](https://github.com/dbt-labs/dbt-spark) package:
+```bash
+pip install dbt-spark
+```
+
+3. Clone this repo and change into the `hudi-examples-dbt` directory from the command line:
+```bash
+cd hudi-examples/hudi-examples-dbt
+```
+
+4. Set up a profile called `spark` to connect to a spark cluster by following [these instructions](https://docs.getdbt.com/reference/warehouse-profiles/spark-profile). If you have access to a data warehouse, you can use those credentials – we recommend setting your [target schema](https://docs.getdbt.com/docs/configure-your-profile#section-populating-your-profile) to be a new schema (dbt will create the schema for you, as long as you have the right privileges). If you don't have access t [...]
+
+> **NOTE:** You need to include the hudi spark bundle to the spark cluster, the latest supported version is 0.10.1.
+
+5. Ensure your profile is setup correctly from the command line:
+```bash
+dbt debug
+```
+
+Output of the above command should show this text at the end of the output:
+```bash
+All checks passed!
+```
+
+6. Run the models:
+```bash
+dbt run
+```
+
+Output should look like this:
+```bash
+05:47:28  Running with dbt=1.0.0
+05:47:28  Found 5 models, 10 tests, 0 snapshots, 0 analyses, 0 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
+05:47:28
+05:47:29  Concurrency: 1 threads (target='local')
+05:47:29
+05:47:29  1 of 5 START incremental model analytics.hudi_insert_table...................... [RUN]
+05:47:31  1 of 5 OK created incremental model analytics.hudi_insert_table................. [OK in 2.61s]
+05:47:31  2 of 5 START incremental model analytics.hudi_insert_overwrite_table............ [RUN]
+05:47:34  2 of 5 OK created incremental model analytics.hudi_insert_overwrite_table....... [OK in 3.19s]
+05:47:34  3 of 5 START incremental model analytics.hudi_upsert_table...................... [RUN]
+05:47:37  3 of 5 OK created incremental model analytics.hudi_upsert_table................. [OK in 2.68s]
+05:47:37  4 of 5 START incremental model analytics.hudi_upsert_partitioned_cow_table...... [RUN]
+05:47:40  4 of 5 OK created incremental model analytics.hudi_upsert_partitioned_cow_table. [OK in 2.60s]
+05:47:40  5 of 5 START incremental model analytics.hudi_upsert_partitioned_mor_table...... [RUN]
+05:47:42  5 of 5 OK created incremental model analytics.hudi_upsert_partitioned_mor_table. [OK in 2.53s]
+05:47:42
+05:47:42  Finished running 5 incremental models in 14.70s.
+05:47:42
+05:47:42  Completed successfully
+```
+7. Test the output of the models:
+```bash
+dbt test
+```
+Output should look like this:
+```bash
+05:48:17  Running with dbt=1.0.0
+05:48:17  Found 5 models, 10 tests, 0 snapshots, 0 analyses, 0 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics
+05:48:17
+05:48:19  Concurrency: 1 threads (target='local')
+05:48:19
+05:48:19  1 of 10 START test not_null_hudi_insert_overwrite_table_id...................... [RUN]
+05:48:19  1 of 10 PASS not_null_hudi_insert_overwrite_table_id............................ [PASS in 0.50s]
+05:48:19  2 of 10 START test not_null_hudi_insert_overwrite_table_name.................... [RUN]
+05:48:20  2 of 10 PASS not_null_hudi_insert_overwrite_table_name.......................... [PASS in 0.45s]
+05:48:20  3 of 10 START test not_null_hudi_insert_overwrite_table_ts...................... [RUN]
+05:48:20  3 of 10 PASS not_null_hudi_insert_overwrite_table_ts............................ [PASS in 0.47s]
+05:48:20  4 of 10 START test not_null_hudi_insert_table_id................................ [RUN]
+05:48:20  4 of 10 PASS not_null_hudi_insert_table_id...................................... [PASS in 0.44s]
+05:48:20  5 of 10 START test not_null_hudi_upsert_table_id................................ [RUN]
+05:48:21  5 of 10 PASS not_null_hudi_upsert_table_id...................................... [PASS in 0.38s]
+05:48:21  6 of 10 START test not_null_hudi_upsert_table_name.............................. [RUN]
+05:48:21  6 of 10 PASS not_null_hudi_upsert_table_name.................................... [PASS in 0.40s]
+05:48:21  7 of 10 START test not_null_hudi_upsert_table_ts................................ [RUN]
+05:48:22  7 of 10 PASS not_null_hudi_upsert_table_ts...................................... [PASS in 0.38s]
+05:48:22  8 of 10 START test unique_hudi_insert_overwrite_table_id........................ [RUN]
+05:48:23  8 of 10 PASS unique_hudi_insert_overwrite_table_id.............................. [PASS in 1.32s]
+05:48:23  9 of 10 START test unique_hudi_insert_table_id.................................. [RUN]
+05:48:24  9 of 10 PASS unique_hudi_insert_table_id........................................ [PASS in 1.26s]
+05:48:24  10 of 10 START test unique_hudi_upsert_table_id................................. [RUN]
+05:48:25  10 of 10 PASS unique_hudi_upsert_table_id....................................... [PASS in 1.29s]
+05:48:26
+05:48:26  Finished running 10 tests in 8.23s.
+05:48:26
+05:48:26  Completed successfully
+05:48:26
+05:48:26  Done. PASS=10 WARN=0 ERROR=0 SKIP=0 TOTAL=10
+```
+
+8. Generate documentation for the project:
+```bash
+dbt docs generate
+```
+
+9. View the [documentation](http://127.0.0.1:8080/#!/overview) for the project after running the following command:
+```bash
+dbt docs serve
+```
+
+---
+For more information on dbt:
+- Read the [introduction to dbt](https://docs.getdbt.com/docs/introduction).
+- Read the [dbt viewpoint](https://docs.getdbt.com/docs/about/viewpoint).
+- Join the [dbt community](http://community.getdbt.com/).
+---
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/dbt_project.yml b/hudi-examples/hudi-examples-dbt/dbt_project.yml
new file mode 100644
index 0000000000..dc5f5593d6
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/dbt_project.yml
@@ -0,0 +1,49 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+# Name your project! Project names should contain only lowercase characters
+# and underscores. A good package name should reflect your organization's
+# name or the intended use of these models
+name: 'hudi_examples_dbt'
+version: '1.0.0'
+config-version: 2
+
+# This setting configures which "profile" dbt uses for this project.
+profile: 'spark'
+
+# These configurations specify where dbt should look for different types of files.
+# The `source-paths` config, for example, states that models in this project can be
+# found in the "models/" directory. You probably won't need to change these!
+model-paths: ["models"]
+
+target-path: "target"  # directory which will store compiled SQL files
+clean-targets:         # directories to be removed by `dbt clean`
+    - "target"
+    - "dbt_modules"
+
+# Configuring models
+# Full documentation: https://docs.getdbt.com/docs/configuring-models
+
+# In this example config, we tell dbt to build all models in the example/ directory
+# as tables. These settings can be overridden in the individual model files
+# using the `{{ config(...) }}` macro.
+models:
+  +file_format: hudi
+  hudi_examples_dbt:
+      # Applies to all files under models/example/
+      example:
+          materialized: table
diff --git a/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_overwrite_table.sql b/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_overwrite_table.sql
new file mode 100644
index 0000000000..e0afa5a456
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_overwrite_table.sql
@@ -0,0 +1,37 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+
+/*
+ Example of an insert_overwrite for a non-partitioned table with incremental materialization.
+ */
+{{ config(
+    materialized='incremental',
+    file_format='hudi',
+    incremental_strategy='insert_overwrite',
+    options={
+        'type': 'cow',
+        'precombineKey': 'ts',
+    },
+    unique_key='id'
+   )
+}}
+
+select id, cast(rand() as string) as name, current_timestamp() as ts
+from {{ ref('hudi_insert_table') }}
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_table.sql b/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_table.sql
new file mode 100644
index 0000000000..a77bf796ca
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/hudi_insert_table.sql
@@ -0,0 +1,42 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+
+/*
+ Example of an insert for a non-partitioned table with incremental materialization.
+ */
+{{
+    config(
+        materialized='incremental',
+        file_format='hudi',
+        unique_key='id'
+    )
+}}
+
+with source_data as (
+
+    select format_number(rand()*1000, 0) as id
+    union all
+    select null as id
+
+    )
+
+select *
+from source_data
+where id is not null
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_cow_table.sql b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_cow_table.sql
new file mode 100644
index 0000000000..caedcbc5fd
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_cow_table.sql
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+
+/*
+ Example of an upsert for a partitioned copy on write table with incremental materialization using merge strategy.
+ */
+{{ config(
+    materialized='incremental',
+    file_format='hudi',
+    incremental_strategy='merge',
+    options={
+        'type': 'cow',
+        'primaryKey': 'id',
+        'precombineKey': 'ts',
+    },
+    unique_key='id',
+    partition_by='datestr',
+    pre_hook=["set spark.sql.datetime.java8API.enabled=false;"],
+   )
+}}
+
+select id, name, current_timestamp() as ts, current_date as datestr
+from {{ ref('hudi_upsert_table') }}
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_mor_table.sql b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_mor_table.sql
new file mode 100644
index 0000000000..2beab7c4ae
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_partitioned_mor_table.sql
@@ -0,0 +1,40 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+
+/*
+ Example of an upsert for a partitioned merge on read table with incremental materialization using merge strategy.
+ */
+{{ config(
+    materialized='incremental',
+    file_format='hudi',
+    incremental_strategy='merge',
+    options={
+        'type': 'mor',
+        'primaryKey': 'id',
+        'precombineKey': 'ts',
+    },
+    unique_key='id',
+    partition_by='datestr',
+    pre_hook=["set spark.sql.datetime.java8API.enabled=false;"],
+   )
+}}
+
+select id, name, current_timestamp() as ts, current_date as datestr
+from {{ ref('hudi_upsert_table') }}
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_table.sql b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_table.sql
new file mode 100644
index 0000000000..b8ee5b3ed4
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/hudi_upsert_table.sql
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ *
+ */
+
+/*
+ Example of an upsert for a non-partitioned table with incremental materialization using merge strategy.
+ */
+{{ config(
+    materialized='incremental',
+    file_format='hudi',
+    incremental_strategy='merge',
+    options={
+        'type': 'cow',
+        'primaryKey': 'id',
+        'precombineKey': 'ts',
+    },
+    unique_key='id'
+   )
+}}
+
+select id, name, current_timestamp() as ts
+from {{ ref('hudi_insert_overwrite_table') }}
\ No newline at end of file
diff --git a/hudi-examples/hudi-examples-dbt/models/example/schema.yml b/hudi-examples/hudi-examples-dbt/models/example/schema.yml
new file mode 100644
index 0000000000..64ae9099bd
--- /dev/null
+++ b/hudi-examples/hudi-examples-dbt/models/example/schema.yml
@@ -0,0 +1,104 @@
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
+
+version: 2
+
+models:
+    - name: hudi_insert_table
+      description: "Hudi insert non-partitioned table with incremental materialization"
+      columns:
+          - name: id
+            description: "The primary key for this table"
+            tests:
+                - unique
+                - not_null
+
+    - name: hudi_insert_overwrite_table
+      description: "Hudi insert overwrite non-partitioned table with incremental materialization"
+      columns:
+        - name: id
+          description: "The primary key for this table"
+          tests:
+            - unique
+            - not_null
+        - name: name
+          description: "Employee name"
+          tests:
+            - not_null
+        - name: ts
+          description: "Created timestamp"
+          tests:
+            - not_null
+
+    - name: hudi_upsert_table
+      description: "Hudi upsert non-partitioned table with incremental materialization"
+      columns:
+        - name: id
+          description: "The primary key for this table"
+          tests:
+            - unique
+            - not_null
+        - name: name
+          description: "Employee name"
+          tests:
+            - not_null
+        - name: ts
+          description: "Created timestamp"
+          tests:
+            - not_null
+
+    - name: hudi_upsert_paritioned_cow_table
+      description: "Hudi upsert partitioned copy-on-write table with incremental materialization using merge strategy"
+      columns:
+        - name: id
+          description: "The primary key for this table"
+          tests:
+            - unique
+            - not_null
+        - name: name
+          description: "Employee name"
+          tests:
+            - not_null
+        - name: ts
+          description: "Created timestamp"
+          tests:
+            - not_null
+        - name: datestr
+          description: "Partition date string column"
+          tests:
+            - not_null
+
+    - name: hudi_upsert_paritioned_mor_table
+      description: "Hudi upsert partitioned merge-on-read table with incremental materialization using merge strategy"
+      columns:
+        - name: id
+          description: "The primary key for this table"
+          tests:
+            - unique
+            - not_null
+        - name: name
+          description: "Employee name"
+          tests:
+            - not_null
+        - name: ts
+          description: "Created timestamp"
+          tests:
+            - not_null
+        - name: datestr
+          description: "Partition date string column"
+          tests:
+            - not_null