You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2020/07/15 07:51:44 UTC

[GitHub] [beam] mwalenia opened a new pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

mwalenia opened a new pull request #12263:
URL: https://github.com/apache/beam/pull/12263


   During implementation of DLP transforms I missed the side inputs that can be passed with other arguments.
   This PR fixes my error.
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Dataflow | Flink | Samza | Spark | Twister2
   --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/) | ---
   Java | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink_Java11/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Twister2/lastCompletedBuild/)
   Python | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python38/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/) | ---
   XLang | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Direct/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/) | ---
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/)
   Portable | --- | [![Build Status](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://ci-beam.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] mwalenia commented on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
mwalenia commented on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-659921211


   @kamilwu can you take a look? The others seem to have missed the notifications and I'd like to get this to 2.23


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] mwalenia commented on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
mwalenia commented on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-658607504


   R: @santhh @tysonjh 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] tvalentyn commented on a change in pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on a change in pull request #12263:
URL: https://github.com/apache/beam/pull/12263#discussion_r468935806



##########
File path: sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/DLPDeidentifyText.java
##########
@@ -177,19 +177,24 @@ public DLPDeidentifyText build() {
   @Override
   public PCollection<KV<String, DeidentifyContentResponse>> expand(
       PCollection<KV<String, String>> input) {
-    return input
-        .apply(ParDo.of(new MapStringToDlpRow(getColumnDelimiter())))
-        .apply("Batch Contents", ParDo.of(new BatchRequestForDLP(getBatchSizeBytes())))
-        .apply(
-            "DLPDeidentify",
+
+    ParDo.SingleOutput<KV<String, Iterable<Table.Row>>, KV<String, DeidentifyContentResponse>>
+        deidentifyParDo =
             ParDo.of(
                 new DeidentifyText(
                     getProjectId(),
                     getInspectTemplateName(),
                     getDeidentifyTemplateName(),
                     getInspectConfig(),
                     getDeidentifyConfig(),
-                    getHeaderColumns())));
+                    getHeaderColumns()));
+    if (getHeaderColumns() != null) {

Review comment:
       Thanks, merging for now so that we don't have to cherry-pick this into 2.24.0 and we have manually verified the fix on 2.23.0. 
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #12263:
URL: https://github.com/apache/beam/pull/12263#discussion_r456711025



##########
File path: sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/DLPDeidentifyText.java
##########
@@ -177,19 +177,24 @@ public DLPDeidentifyText build() {
   @Override
   public PCollection<KV<String, DeidentifyContentResponse>> expand(
       PCollection<KV<String, String>> input) {
-    return input
-        .apply(ParDo.of(new MapStringToDlpRow(getColumnDelimiter())))
-        .apply("Batch Contents", ParDo.of(new BatchRequestForDLP(getBatchSizeBytes())))
-        .apply(
-            "DLPDeidentify",
+
+    ParDo.SingleOutput<KV<String, Iterable<Table.Row>>, KV<String, DeidentifyContentResponse>>
+        deidentifyParDo =
             ParDo.of(
                 new DeidentifyText(
                     getProjectId(),
                     getInspectTemplateName(),
                     getDeidentifyTemplateName(),
                     getInspectConfig(),
                     getDeidentifyConfig(),
-                    getHeaderColumns())));
+                    getHeaderColumns()));
+    if (getHeaderColumns() != null) {

Review comment:
       Do we need to add new unit tests to cover the new functionality ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] tvalentyn merged pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
tvalentyn merged pull request #12263:
URL: https://github.com/apache/beam/pull/12263


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] mwalenia commented on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
mwalenia commented on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-660005888


   @tvalentyn Do you think you can help me out? @santhh noticed a bug in the code heading into 2.23 and this PR fixes it, it would be nice to have this in current release. WDYT?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] tvalentyn commented on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-662158727


   I added the fix to the release branch. @santhh will help validate this feature on 2.23.0 RC2.
   I suggest we add necessary tests to this PR and merge it into master afterwards. Adding 2.24.0 to Fix Version set on [BEAM-10492], let's try to finalize this PR before 2.24.0 branch is cut.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] tvalentyn commented on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-671495214


   @mwalenia 2.24.0 branch cut is in a couple of days - are you still working on this change?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] mwalenia removed a comment on pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
mwalenia removed a comment on pull request #12263:
URL: https://github.com/apache/beam/pull/12263#issuecomment-659921211


   @kamilwu can you take a look? The others seem to have missed the notifications and I'd like to get this to 2.23


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] mwalenia commented on a change in pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
mwalenia commented on a change in pull request #12263:
URL: https://github.com/apache/beam/pull/12263#discussion_r457388497



##########
File path: sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/DLPDeidentifyText.java
##########
@@ -177,19 +177,24 @@ public DLPDeidentifyText build() {
   @Override
   public PCollection<KV<String, DeidentifyContentResponse>> expand(
       PCollection<KV<String, String>> input) {
-    return input
-        .apply(ParDo.of(new MapStringToDlpRow(getColumnDelimiter())))
-        .apply("Batch Contents", ParDo.of(new BatchRequestForDLP(getBatchSizeBytes())))
-        .apply(
-            "DLPDeidentify",
+
+    ParDo.SingleOutput<KV<String, Iterable<Table.Row>>, KV<String, DeidentifyContentResponse>>
+        deidentifyParDo =
             ParDo.of(
                 new DeidentifyText(
                     getProjectId(),
                     getInspectTemplateName(),
                     getDeidentifyTemplateName(),
                     getInspectConfig(),
                     getDeidentifyConfig(),
-                    getHeaderColumns())));
+                    getHeaderColumns()));
+    if (getHeaderColumns() != null) {

Review comment:
       It would be nice to add them, but I have no idea how to approach checking this. Do you have any suggestions?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] tysonjh commented on a change in pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
tysonjh commented on a change in pull request #12263:
URL: https://github.com/apache/beam/pull/12263#discussion_r457804483



##########
File path: sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/DLPDeidentifyText.java
##########
@@ -177,19 +177,24 @@ public DLPDeidentifyText build() {
   @Override
   public PCollection<KV<String, DeidentifyContentResponse>> expand(
       PCollection<KV<String, String>> input) {
-    return input
-        .apply(ParDo.of(new MapStringToDlpRow(getColumnDelimiter())))
-        .apply("Batch Contents", ParDo.of(new BatchRequestForDLP(getBatchSizeBytes())))
-        .apply(
-            "DLPDeidentify",
+
+    ParDo.SingleOutput<KV<String, Iterable<Table.Row>>, KV<String, DeidentifyContentResponse>>
+        deidentifyParDo =
             ParDo.of(
                 new DeidentifyText(
                     getProjectId(),
                     getInspectTemplateName(),
                     getDeidentifyTemplateName(),
                     getInspectConfig(),
                     getDeidentifyConfig(),
-                    getHeaderColumns())));
+                    getHeaderColumns()));
+    if (getHeaderColumns() != null) {

Review comment:
       I'm not sure of all the available tools for testing DoFns in Beam so there may be better options but one way would be to use the @NeedsRunner annotation and some PAsserts to verify the expected output ([from the wiki](https://cwiki.apache.org/confluence/display/BEAM/Contribution+Testing+Guide#ContributionTestingGuide-EffectiveuseoftheTestPipelineJUnitrule))? 
   
   It would require injecting a mock cloud client library to avoid making any calls to the live API.
   
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [beam] chamikaramj commented on a change in pull request #12263: [BEAM-10492] Add missing sideinput handling to DLP transforms

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on a change in pull request #12263:
URL: https://github.com/apache/beam/pull/12263#discussion_r468930529



##########
File path: sdks/java/extensions/ml/src/main/java/org/apache/beam/sdk/extensions/ml/DLPDeidentifyText.java
##########
@@ -177,19 +177,24 @@ public DLPDeidentifyText build() {
   @Override
   public PCollection<KV<String, DeidentifyContentResponse>> expand(
       PCollection<KV<String, String>> input) {
-    return input
-        .apply(ParDo.of(new MapStringToDlpRow(getColumnDelimiter())))
-        .apply("Batch Contents", ParDo.of(new BatchRequestForDLP(getBatchSizeBytes())))
-        .apply(
-            "DLPDeidentify",
+
+    ParDo.SingleOutput<KV<String, Iterable<Table.Row>>, KV<String, DeidentifyContentResponse>>
+        deidentifyParDo =
             ParDo.of(
                 new DeidentifyText(
                     getProjectId(),
                     getInspectTemplateName(),
                     getDeidentifyTemplateName(),
                     getInspectConfig(),
                     getDeidentifyConfig(),
-                    getHeaderColumns())));
+                    getHeaderColumns()));
+    if (getHeaderColumns() != null) {

Review comment:
       It was just a suggestion. I'm OK with merging as it is if that's OK with the author of the PR.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org