You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@hudi.apache.org by "codope (via GitHub)" <gi...@apache.org> on 2023/02/13 07:42:27 UTC

[GitHub] [hudi] codope opened a new pull request, #7929: [DOCS] [WIP] Add new sources to deltastreamer docs

codope opened a new pull request, #7929:
URL: https://github.com/apache/hudi/pull/7929

   ### Change Logs
   
   _Describe context and summary for this change. Highlight if any code was copied._
   
   ### Impact
   
   _Describe any public API or user-facing feature change or any performance impact._
   
   ### Risk level (write none, low medium or high below)
   
   _If medium or high, explain what verification was done to mitigate the risks._
   
   ### Documentation Update
   
   _Describe any necessary documentation update if there is any new feature, config, or user-facing change_
   
   - _The config description must be updated if new configs are added or the default value of the configs are changed_
   - _Any new feature or user-facing change requires updating the Hudi website. Please create a Jira ticket, attach the
     ticket number here and follow the [instruction](https://hudi.apache.org/contribute/developer-setup#website) to make
     changes to the website._
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1106853958


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.

Review Comment:
   I think we need not put the whole document. We typically assume that users know how to enable event notifications. What we can add here is the two spark-submit command samples for the two sources.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1105412713


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pubsub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pubsub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pubsub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pubsub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pubsub subscription corresponding to the topic
+3. Note the GCS Project Id, the GCS Subscription Id and use them for the following Hoodie configurations:
+   1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID
+   2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID
+   3. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as
+      `org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related
+      configs as described in the detailed guide mentiond above.
+4. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as

Review Comment:
   Fixed. Thanks for pointint out.



##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pubsub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pubsub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pubsub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pubsub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pubsub subscription corresponding to the topic
+3. Note the GCS Project Id, the GCS Subscription Id and use them for the following Hoodie configurations:
+   1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID
+   2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID
+   3. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as
+      `org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related
+      configs as described in the detailed guide mentiond above.
+4. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as

Review Comment:
   Fixed. Thanks for pointing out.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] nsivabalan commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "nsivabalan (via GitHub)" <gi...@apache.org>.
nsivabalan commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1106032816


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pub/Sub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pub/Sub subscription corresponding to the topic
+3. Note the GCP Project Id, the Pub/Sub Subscription Id and use them for the following Hoodie configurations:
+   1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID
+   2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID
+   3. Start the `GcsEventsSource` using the `HoodieDeltaStreamer` utility with --source-class parameter as
+      `org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related

Review Comment:
   nit: within single quotes



##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.

Review Comment:
   we usually don't link to adhoc google docs. it should be part of the RFC. curious to know why not update the RFC only ? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] codope commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "codope (via GitHub)" <gi...@apache.org>.
codope commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1158026176


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.

Review Comment:
   Have added the spark-submit commands.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] pramodbiligiri commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "pramodbiligiri (via GitHub)" <gi...@apache.org>.
pramodbiligiri commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1105341974


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pubsub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pubsub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pubsub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pubsub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pubsub subscription corresponding to the topic
+3. Note the GCS Project Id, the GCS Subscription Id and use them for the following Hoodie configurations:
+   1. hoodie.deltastreamer.source.gcs.project.id=GCP_PROJECT_ID
+   2. hoodie.deltastreamer.source.gcs.subscription.id=SUSBCRIPTION_ID
+   3. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as
+      `org.apache.hudi.utilities.sources.GcsEventsSource` and hoodie.deltastreamer.source.cloud.meta.ack=true, and path related
+      configs as described in the detailed guide mentiond above.
+4. Start the GcsEventsSource using the `HoodieDeltaStreamer` utility with --source-class parameter as

Review Comment:
   Another typo in original PR: Here it should be "Start the GcsEventsHoodieIncrSource" and not "Start the GcsEventsSource"



##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pubsub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pubsub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pubsub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.
+A high level overview of the same is provided below.
+1. Configure Cloud Storage Pubsub Notifications for the bucket. Follow Google’s documentation here: [https://cloud.google.com/storage/docs/reporting-changes](reporting changes)
+2. Create a Pubsub subscription corresponding to the topic
+3. Note the GCS Project Id, the GCS Subscription Id and use them for the following Hoodie configurations:

Review Comment:
   I've made a typo in the original PR: This should be "Note the GCP Project Id" and not "Note the GCS Project Id"



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [hudi] pramodbiligiri commented on a diff in pull request #7929: [HUDI-5754] Add new sources to deltastreamer docs

Posted by "pramodbiligiri (via GitHub)" <gi...@apache.org>.
pramodbiligiri commented on code in PR #7929:
URL: https://github.com/apache/hudi/pull/7929#discussion_r1106058509


##########
website/docs/hoodie_deltastreamer.md:
##########
@@ -340,6 +388,26 @@ to trigger/processing of new or changed data as soon as it is available on S3.
 
 Insert code sample from this blog: https://hudi.apache.org/blog/2021/08/23/s3-events-source/#configuration-and-setup
 
+### GCS Events
+Google Cloud Storage (GCS) service provides an event notification mechanism which will post notifications when certain
+events happen in your GCS bucket. You can read more at [Pub/Sub Notifications](https://cloud.google.com/storage/docs/pubsub-notifications/).
+GCS will put these events in a Cloud Pub/Sub topic. Apache Hudi provides a GcsEventsSource that can read from Cloud Pub/Sub
+to trigger/processing of new or changed data as soon as it is available on GCS.
+
+#### Setup
+A detailed guide on [How to use the system](https://docs.google.com/document/d/1VfvtdvhXw6oEHPgZ_4Be2rkPxIzE0kBCNUiVDsXnSAA/edit#heading=h.tpmqk5oj0crt) is available.

Review Comment:
   This is the only publicly available reference doc for this feature. It was contributed to OSS (by me) a while after being developed. There's an older version of this doc but it is not public.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@hudi.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org