You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by GitBox <gi...@apache.org> on 2022/04/02 20:15:20 UTC

[GitHub] [hudi] xushiyan commented on pull request #5125: [HUDI-3357] MVP implementation of BigQuerySyncTool

xushiyan commented on pull request #5125:
URL: https://github.com/apache/hudi/pull/5125#issuecomment-1086715332


   ## Test setup
   
   - Launch Dataproc 2.0.34-ubuntu18
   - From Dataproc instance launch spark-shell
   ```shell
   spark-shell \
     --jars gs://xxx/hudi-spark3.1-bundle_2.12-0.11.0-SNAPSHOT.jar \
     --packages org.apache.spark:spark-avro_2.12:3.1.2 \
     --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' \
     --conf 'spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension' \
   ```
   - prepare a partitioned table on GS
   ```scala
   import org.apache.hudi.QuickstartUtils._
   import scala.collection.JavaConversions._
   import org.apache.spark.sql.SaveMode._
   import org.apache.hudi.DataSourceReadOptions._
   import org.apache.hudi.DataSourceWriteOptions._
   import org.apache.hudi.config.HoodieWriteConfig._
   
   val tableName = "hudi_cow_pt_tbl"
   
   spark.sql(
     s"""
       |create table $tableName (
       |  id bigint,
       |  name string,
       |  ts bigint,
       |  dt string
       |) using hudi
       |tblproperties (
       |  type = 'cow',
       |  primaryKey = 'id',
       |  preCombineField = 'ts',
       |  hoodie.datasource.write.hive_style_partitioning = 'true',
       |  hoodie.datasource.write.drop.partition.columns = 'true',
       |  hoodie.metadata.enable = 'false'
       | )
       |partitioned by (dt)
       |location 'gs://foo/bar';
       |""")
   spark.sql(
     s"""
       |insert into $tableName partition (dt)
       |select 1 as id, 'a1' as name, 1000 as ts, '2021-12-09' as dt;
       |""")
   ```
   - Build bundle jars and assembly jar from `hudi-gcp-bundle` module
   ```shell
   mvn -T 2.5C clean install -DskipTests -Dcheckstyle.skip=true -Drat.skip=true -Djacoco.skip=true -Dmaven.javadoc.skip=true -Dscala12 -Dspark3.1
   mvn assembly:single package -pl packaging/hudi-gcp-bundle
   ```
   - Put bundle jars and gcp bundle fat jar in GS bucket
   ```
   gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar 
   ```
   - Go to BigQuery and create a Dataset `mydataset` (set its location to the same as GS bucket's)
   - From Dataproc server submit the sync tool job
   ```shell
   spark-submit --master yarn \
   --packages org.apache.spark:spark-avro_2.12:3.1.2 \
   --class org.apache.hudi.gcp.bigquery.BigQuerySyncTool \
   gs://xxx/hudi-gcp-bundle-0.11.0-SNAPSHOT-jar-with-dependencies.jar \
   --project-id myproject \
   --dataset-name mydataset \
   --dataset-location <location> \
   --table-name foobar \
   --source-uri gs://foo/bar/dt=* \
   --source-uri-prefix gs://foo/bar/ \
   --base-path gs://foo/bar \
   --partitioned-by dt \
   ```
   - See the job complete from logs
   ```
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: Manifest External table created.
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Manifest table creation complete for 20220402t081216_manifest
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: External table created using hivepartitioningoptions
   22/04/02 19:42:59 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Versions table creation complete for 20220402t081216_versions
   22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.HoodieBigQuerySyncClient: View created successfully
   22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Snapshot view creation complete for 20220402t081216
   22/04/02 19:43:01 INFO org.apache.hudi.gcp.bigquery.BigQuerySyncTool: Sync table complete for 20220402t081216
   ```
   - See the tables created from bigquery. there should be 2 tables (with suffix `manifest` and ` versions`) and 1 view created. Query the view for the hudi table. Before https://issues.apache.org/jira/browse/HUDI-3290 is landed, manually delete the `.hoodie_partition_metadata` to see the results as a workaround.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org