You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "Vinoth Govindarajan (Jira)" <ji...@apache.org> on 2022/04/04 03:08:00 UTC

[jira] [Comment Edited] (HUDI-2438) [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync

    [ https://issues.apache.org/jira/browse/HUDI-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516608#comment-17516608 ] 

Vinoth Govindarajan edited comment on HUDI-2438 at 4/4/22 3:07 AM:
-------------------------------------------------------------------

Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, [~joyansil] implemented a Java client for this, the details are in this ticket: https://issues.apache.org/jira/browse/HUDI-3020

 

BigQuerySyncTool is also available, it will be part of 0.11.0 release, which already invoke the manifest generation code as part of the sync method.

 

Here is how you can test it out:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.metadata.ManifestFileUtilval conf = new Configuration();{code}
{code:java}
val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi"
val manifestFileUtil: ManifestFileUtil = ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build();
manifestFileUtil.writeManifestFile() {code}
{code:java}
gsutil cat gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/.hoodie/manifest/latest-snapshot.csv%7Chead -5{code}
{code:java}
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/95e78133-08dd-4721-9c49-8fbe338589f0-0_621-10-1021_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/1c01ddf1-41e8-43bf-94f3-9a75c9e39b21-0_1238-12-1638_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/a021864b-d0b6-462a-8f47-6370668993d6-0_694-12-1094_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/11815af9-6ed5-4f05-9b6b-7f79479f4f53-0_625-12-1025_20210927173651.parquet
gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi/op_cmpny_cd=WMT-US/visit_date=2020-11-01/ef4fc799-9124-4656-8106-f0374590dde6-0_1151-12-1551_20210927173651.parquet {code}


was (Author: vino):
Hi [~gauravrai0x] & [~l0s01w3], There is a way to generate manifest files, [~joyansil] implemented a Java client for this, the details are in this ticket: https://issues.apache.org/jira/browse/HUDI-3020

 

BigQuerySyncTool is also available, it will be part of 0.11.0 release, which already invoke the manifest generation code as part of the sync method.

 

Here is how you can test it out:
{code:java}
import org.apache.hadoop.conf.Configuration;
import org.apache.hudi.metadata.ManifestFileUtilval conf = new Configuration();{code}
{code:java}
val basePath = "gs://udp-hudi-storage5/store_visit_scan_bootstrap_hudi"
val manifestFileUtil: ManifestFileUtil = ManifestFileUtil.builder().setConf(conf).setBasePath(basePath).build();
manifestFileUtil.writeManifestFile() {code}
 

> [Umbrella] [RFC-34] Implement BigQuerySyncTool for BigQuery Sync
> ----------------------------------------------------------------
>
>                 Key: HUDI-2438
>                 URL: https://issues.apache.org/jira/browse/HUDI-2438
>             Project: Apache Hudi
>          Issue Type: Epic
>          Components: Common Core, meta-sync
>            Reporter: Vinoth Govindarajan
>            Assignee: Vinoth Govindarajan
>            Priority: Blocker
>              Labels: BigQuery, Integration, pull-request-available
>             Fix For: 0.11.0
>
>
> BigQuery is Google Cloud's fully managed, petabyte-scale, and cost-effective analytics data warehouse that lets you run analytics over vast amounts of data in near real-time. BigQuery currently [doesn’t support|https://cloud.google.com/bigquery/external-data-cloud-storage] Apache Hudi file format, but it has support for the Parquet file format. The proposal is to implement a BigQuerySync similar to HiveSync to sync the Hudi table as the BigQuery External Parquet table so that users can query the Hudi tables using BigQuery. Uber is already syncing some of its Hudi tables to BigQuery data mart this will help them to write, sync, and query.
>  
> More details are in RFC-34: [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=188745980]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)