You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by GitBox <gi...@apache.org> on 2020/03/24 09:36:38 UTC

[GitHub] [beam] ihji opened a new pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

ihji opened a new pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205
 
 
   
   ------------------------
   
   Thank you for your contribution! Follow this checklist to help us incorporate your contribution quickly and easily:
   
    - [ ] [**Choose reviewer(s)**](https://beam.apache.org/contribute/#make-your-change) and mention them in a comment (`R: @username`).
    - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
    - [ ] Update `CHANGES.md` with noteworthy changes.
    - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf).
   
   See the [Contributor Guide](https://beam.apache.org/contribute) for more tips on [how to make review process smoother](https://beam.apache.org/contribute/#make-reviewers-job-easier).
   
   Post-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   Lang | SDK | Apex | Dataflow | Flink | Gearpump | Samza | Spark
   --- | --- | --- | --- | --- | --- | --- | ---
   Go | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Go_VR_Spark/lastCompletedBuild/)
   Java | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Apex/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Dataflow_Java11/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Flink/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Flink_Streaming/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Gearpump/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Samza/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_Spark/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_PVR_Spark_Batch/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Java_ValidatesRunner_SparkStructuredStreaming/lastCompletedBuild/)
   Python | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python36/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python37/lastCompletedBuild/) | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_VR_Dataflow_V2/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Py_ValCont/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python2_PVR_Flink_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python35_VR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_Python_VR_Spark/lastCompletedBuild/)
   XLang | --- | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Flink/lastCompletedBuild/) | --- | --- | [![Build Status](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PostCommit_XVR_Spark/lastCompletedBuild/)
   
   Pre-Commit Tests Status (on master branch)
   ------------------------------------------------------------------------------------------------
   
   --- |Java | Python | Go | Website
   --- | --- | --- | --- | ---
   Non-portable | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Java_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Python_Cron/lastCompletedBuild/)<br>[![Build Status](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_PythonLint_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Go_Cron/lastCompletedBuild/) | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Website_Cron/lastCompletedBuild/) 
   Portable | --- | [![Build Status](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/badge/icon)](https://builds.apache.org/job/beam_PreCommit_Portable_Python_Cron/lastCompletedBuild/) | --- | ---
   
   See [.test-infra/jenkins/README](https://github.com/apache/beam/blob/master/.test-infra/jenkins/README.md) for trigger phrase, status and link of all Jenkins jobs.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403405278
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -214,24 +220,90 @@ public static Environment createProcessEnvironment(
       pathsToStage.addAll(stagingFiles);
     }
 
-    ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableList.Builder<Supplier<ArtifactInformation>> lazyArtifactsBuilder =
+        ImmutableList.builder();
     for (String path : pathsToStage) {
 
 Review comment:
   Don't we want this for loop to be lazy? 
   
   Rather than introducing intermediate streams of Suppliers, I think we could just rename the existing `getArtifacts()` something like `getNonDeferredArtifacts()` and then call it during resolution. 
   
   ```
   if (key.equals(deferredArtifactPayload.getKey())) {
         return getNonDeferredArtifacts(options);
   }
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-608055038
 
 
   > Friendly question: is this PR close? I'm still having to trigger Java precommits 3-4 times in order to get a green run, partially due to these timeouts.
   
   We have agreed on the solution and this PR represents an early version of it so I would say that we are close.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-608803659
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] mxm commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
mxm commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610480293
 
 
   Since this continues to affect many Beam developers with test timeouts, do you think we could commit an intermediate fix for this? Or perhaps revert the experimental feature until a fix has been developed?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402684430
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
 
 Review comment:
   done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r405136418
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -207,31 +210,94 @@ public static Environment createProcessEnvironment(
     }
   }
 
-  public static Collection<ArtifactInformation> getArtifacts(PipelineOptions options) {
-    Set<String> pathsToStage = Sets.newHashSet();
-    List<String> stagingFiles = options.as(PortablePipelineOptions.class).getFilesToStage();
-    if (stagingFiles != null) {
-      pathsToStage.addAll(stagingFiles);
-    }
-
-    ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+  private static List<ArtifactInformation> getArtifacts(List<String> stagingFiles) {
+    Set<String> pathsToStage = Sets.newHashSet(stagingFiles);
+    ImmutableList.Builder<ArtifactInformation> artifactsBuilder = ImmutableList.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        ArtifactInformation.Builder artifactBuilder = ArtifactInformation.newBuilder();
+        artifactBuilder.setTypeUrn(BeamUrns.getUrn(StandardArtifacts.Types.FILE));
+        artifactBuilder.setRoleUrn(BeamUrns.getUrn(StandardArtifacts.Roles.STAGING_TO));
+        artifactBuilder.setRolePayload(
+            RunnerApi.ArtifactStagingToRolePayload.newBuilder()
+                .setStagedName(createStagingFileName(file))
+                .build()
+                .toByteString());
         if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
+          File zippedFile;
+          HashCode hashCode;
           try {
-            filesToStage.add(createArtifactInformation(zipDirectory(file)));
+            zippedFile = zipDirectory(file);
+            hashCode = Files.asByteSource(zippedFile).hash(Hashing.sha256());
           } catch (IOException e) {
             throw new RuntimeException(e);
           }
+          artifactsBuilder.add(
+              artifactBuilder
+                  .setTypePayload(
+                      RunnerApi.ArtifactFilePayload.newBuilder()
+                          .setPath(zippedFile.getPath())
+                          .setSha256(hashCode.toString())
+                          .build()
+                          .toByteString())
+                  .build());
         } else {
-          filesToStage.add(createArtifactInformation(file));
+          HashCode hashCode;
+          try {
+            hashCode = Files.asByteSource(file).hash(Hashing.sha256());
+          } catch (IOException e) {
+            throw new RuntimeException(e);
 
 Review comment:
   Or would it be better to let the method throw an IOException? 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-603297142
 
 
   Run Java PreCommit

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403403332
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.List;
+import java.util.Map;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists;
+
+/**
+ * A default artifact resolver. This resolver applies {@link ResolutionFn} in the reversed order
+ * they registered i.e. the function registered later overrides the earlier one if they resolve the
+ * same artifact.
+ */
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static final ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private List<ResolutionFn> fns =
+      Lists.newArrayList(
+          (info) -> {
+            if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+              return ImmutableList.of(info);
+            } else {
+              return ImmutableList.of();
 
 Review comment:
   Is the empty list special? In particular sometimes a deferred artifact may resolve to nothing, which is different than not being able to be resolved... I think we still need optional or null or an exception to denote unresolveable by this resolver. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402699096
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,15 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A unique string identifier assigned by the creator of this payload. The creator may use this key to confirm
+  // whether they can parse the data.
+  string key = 1;
+
+  // A data for deferred artifacts. Interpretation of bytes is delegated to the creator of this payload.
 
 Review comment:
   ```suggestion
     // Data for deferred artifacts. Interpretation of bytes is delegated to the creator of this payload.
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402684405
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
 
 Review comment:
   done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb merged pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb merged pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403402959
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   As discussed (but putting here for the record) having a proxy artifact type could solve this issue. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402479848
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
 
 Review comment:
   Class comment

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402684484
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private ResolutionFn resolver =
+      (info) -> {
+        if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+          return Optional.of(info);
+        } else {
+          return Optional.empty();
+        }
+      };
+
+  @Override
+  public void register(ResolutionFn fn) {
+    resolver =
+        (info) -> {
+          Optional<RunnerApi.ArtifactInformation> resolved = fn.resolve(info);
+          if (resolved.isPresent()) {
+            return resolved;
+          } else {
+            return resolver.resolve(info);
+          }
+        };
+  }
+
+  @Override
+  public RunnerApi.Pipeline resolveArtifacts(RunnerApi.Pipeline pipeline) {
 
 Review comment:
   done :smile:

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-607536354
 
 
   CC: @chamikaramj 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402699398
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,93 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.List;
+import java.util.Map;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap;
+
+/**
+ * A default artifact resolver. This resolver applies {@link ResolutionFn} first matched in the
+ * order they registered.
+ */
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static final ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private ResolutionFn resolver =
+      (info) -> {
+        if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+          return ImmutableList.of(info);
+        } else {
+          return ImmutableList.of();
+        }
+      };
+
+  @Override
+  public void register(ResolutionFn fn) {
+    resolver =
+        (info) -> {
+          List<RunnerApi.ArtifactInformation> resolved = fn.resolve(info);
+          if (!resolved.isEmpty()) {
+            return resolved;
+          } else {
+            return resolver.resolve(info);
+          }
+        };
+  }
+
+  @Override
+  public RunnerApi.Pipeline resolveArtifacts(RunnerApi.Pipeline pipeline) {
+    ImmutableMap.Builder<String, RunnerApi.Environment> environmentMapBuilder =
+        ImmutableMap.builder();
+    for (Map.Entry<String, RunnerApi.Environment> entry :
+        pipeline.getComponents().getEnvironmentsMap().entrySet()) {
+      List<RunnerApi.ArtifactInformation> resolvedDependencies =
+          entry
+              .getValue()
+              .getDependenciesList()
+              .parallelStream()
+              .flatMap(
+                  (info) -> {
+                    List<RunnerApi.ArtifactInformation> resolved = resolver.resolve(info);
+                    if (resolved.isEmpty()) {
+                      throw new RuntimeException(
+                          String.format("cannot resolve artifact information: %s", info));
 
 Review comment:
   ```suggestion
                             String.format("Cannot resolve artifact information: %s", info));
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-604028675
 
 
   directory type and unzip_to role added. didn't create a interface for resolving artifacts since we only have one (resolving directory) now. maybe refactor later for supporting more resolvers.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402697023
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,15 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A unique string identifier assigned by the creator of this payload. The creator may use this key to confirm
+  // whether they can parse the data.
+  string key = 1;
+
+  // A data for deferred artifacts. Interpretation of bytes is delegated to the creator of this payload.
+  bytes data = 2;
 
 Review comment:
   ```suggestion
     bytes payload = 2;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-608857763
 
 
   @lukecwik Ready to merge. Please trigger the test and take a final look.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402499335
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
+  void register(ResolutionFn fn);
+
+  RunnerApi.Pipeline resolveArtifacts(RunnerApi.Pipeline pipeline);
+
+  interface ResolutionFn {
+    Optional<RunnerApi.ArtifactInformation> resolve(RunnerApi.ArtifactInformation info);
 
 Review comment:
   As with the resolution API, one may want to attempt to resolve multiple artifacts (e.g. maven dependencies) simultaneously. One may also need to return multiple artifacts as the resolution of a single artifact (e.g. the deferred "ambient environment" one). 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610058685
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] iemejia commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
iemejia commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-603165704
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402685778
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   Don't we need at least a key field (or urn or identifier) which can be used to check whether the payload is parsable by the creator? Otherwise, it would be pretty hard to know where the bytes payload originally came from.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403407082
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,15 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A unique string identifier assigned by the creator of this payload. The creator may use this key to confirm
+  // whether they can parse the data.
+  string key = 1;
 
 Review comment:
   This is going to have to get revamped for XLang and since it isn't being exported outside of the SDK for portable runners we can easily change it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-603297904
 
 
   Should we add a `directory` option and role that would be to `unzip`?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r400458940
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1146,33 +1146,41 @@ message StandardArtifacts {
   enum Types {
     // A URN for locally-accessible artifact files.
     // payload: ArtifactFilePayload
-    FILE     = 0 [(beam_urn) = "beam:artifact:type:file:v1"];
+    FILE      = 0 [(beam_urn) = "beam:artifact:type:file:v1"];
 
     // A URN for artifacts described by URLs.
     // payload: ArtifactUrlPayload
-    URL      = 1 [(beam_urn) = "beam:artifact:type:url:v1"];
+    URL       = 1 [(beam_urn) = "beam:artifact:type:url:v1"];
 
     // A URN for artifacts embedded in ArtifactInformation proto.
     // payload: EmbeddedFilePayload.
-    EMBEDDED = 2 [(beam_urn) = "beam:artifact:type:embedded:v1"];
+    EMBEDDED  = 2 [(beam_urn) = "beam:artifact:type:embedded:v1"];
 
     // A URN for Python artifacts hosted on PYPI.
     // payload: PypiPayload
-    PYPI     = 3 [(beam_urn) = "beam:artifact:type:pypi:v1"];
+    PYPI      = 3 [(beam_urn) = "beam:artifact:type:pypi:v1"];
 
     // A URN for Java artifacts hosted on a Maven repository.
     // payload: MavenPayload
-    MAVEN    = 4 [(beam_urn) = "beam:artifact:type:maven:v1"];
+    MAVEN     = 4 [(beam_urn) = "beam:artifact:type:maven:v1"];
+
+    // A URN for locally-accessible artifact directory.
+    // payload: ArtifactDirectoryPayload
+    DIRECTORY = 5 [(beam_urn) = "beam:artifact:type:directory:v1"];
   }
   enum Roles {
     // A URN for staging-to role.
     // payload: ArtifactStagingToRolePayload
     STAGING_TO  = 0 [(beam_urn) = "beam:artifact:role:staging_to:v1"];
+
+    // A URN for unzip-to role.
+    // payload: ArtifactUnzipToRolePayload
+    UNZIP_TO    = 1 [(beam_urn) = "beam:artifact:role:unzip_to:v1"];
   }
 }
 
 message ArtifactFilePayload {
-  // a string for an artifact path e.g. "/tmp/foo.jar"
+  // a string for an artifact file path e.g. "/tmp/foo.jar"
   string path = 1;
 
   // The hex-encoded sha256 checksum of the artifact.
 
 Review comment:
   It doesn't seem like we are populating sha256 here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-608952454
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-603394246
 
 
   In this case, these are directories to be added to the classpath, so it may be preferable to keep them zipped, though such a role does make sense. (This is also an example of a new "directory" type that could be "resolved" to a zipfile.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402476502
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
+  void register(ResolutionFn fn);
 
 Review comment:
   We won't need to rely on using Optional if we make registration take a URN and ResolutionFn and then the resolver can be found by URN.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] iemejia removed a comment on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
iemejia removed a comment on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-603165704
 
 
   retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r405136162
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -207,31 +210,94 @@ public static Environment createProcessEnvironment(
     }
   }
 
-  public static Collection<ArtifactInformation> getArtifacts(PipelineOptions options) {
-    Set<String> pathsToStage = Sets.newHashSet();
-    List<String> stagingFiles = options.as(PortablePipelineOptions.class).getFilesToStage();
-    if (stagingFiles != null) {
-      pathsToStage.addAll(stagingFiles);
-    }
-
-    ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+  private static List<ArtifactInformation> getArtifacts(List<String> stagingFiles) {
+    Set<String> pathsToStage = Sets.newHashSet(stagingFiles);
+    ImmutableList.Builder<ArtifactInformation> artifactsBuilder = ImmutableList.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        ArtifactInformation.Builder artifactBuilder = ArtifactInformation.newBuilder();
+        artifactBuilder.setTypeUrn(BeamUrns.getUrn(StandardArtifacts.Types.FILE));
+        artifactBuilder.setRoleUrn(BeamUrns.getUrn(StandardArtifacts.Roles.STAGING_TO));
+        artifactBuilder.setRolePayload(
+            RunnerApi.ArtifactStagingToRolePayload.newBuilder()
+                .setStagedName(createStagingFileName(file))
+                .build()
+                .toByteString());
         if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
+          File zippedFile;
 
 Review comment:
   Nit, there seems to be a fair amount of duplication between these two case. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610530129
 
 
   Retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402688050
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
+  void register(ResolutionFn fn);
+
+  RunnerApi.Pipeline resolveArtifacts(RunnerApi.Pipeline pipeline);
+
+  interface ResolutionFn {
+    Optional<RunnerApi.ArtifactInformation> resolve(RunnerApi.ArtifactInformation info);
 
 Review comment:
   changed to List.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] reuvenlax commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
reuvenlax commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-608034034
 
 
   Friendly question: is this PR close? I'm still having to trigger Java precommits 3-4 times in order to get a green run, partially due to these timeouts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r404412636
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,92 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.List;
+import java.util.Map;
+import java.util.function.Function;
+import java.util.stream.Collectors;
+import java.util.stream.Stream;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableList;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.ImmutableMap;
+import org.apache.beam.vendor.guava.v26_0_jre.com.google.common.collect.Lists;
+
+/**
+ * A default artifact resolver. This resolver applies {@link ResolutionFn} in the reversed order
+ * they registered i.e. the function registered later overrides the earlier one if they resolve the
+ * same artifact.
+ */
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static final ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private List<ResolutionFn> fns =
+      Lists.newArrayList(
+          (info) -> {
+            if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+              return ImmutableList.of(info);
+            } else {
+              return ImmutableList.of();
 
 Review comment:
   Done. Optional list makes three choices: Failure, Success with empty output and Success with a list of artifacts. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403403766
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,15 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A unique string identifier assigned by the creator of this payload. The creator may use this key to confirm
+  // whether they can parse the data.
+  string key = 1;
 
 Review comment:
   Should this be uid? Any collisions here could be bad...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402700601
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -214,24 +220,87 @@ public static Environment createProcessEnvironment(
       pathsToStage.addAll(stagingFiles);
     }
 
-    ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableList.Builder<Supplier<ArtifactInformation>> lazyArtifactsBuilder =
+        ImmutableList.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
-        if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
-          try {
-            filesToStage.add(createArtifactInformation(zipDirectory(file)));
-          } catch (IOException e) {
-            throw new RuntimeException(e);
-          }
-        } else {
-          filesToStage.add(createArtifactInformation(file));
-        }
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        ArtifactInformation.Builder artifactBuilder = ArtifactInformation.newBuilder();
+        artifactBuilder.setTypeUrn(BeamUrns.getUrn(StandardArtifacts.Types.FILE));
+        artifactBuilder.setRoleUrn(BeamUrns.getUrn(StandardArtifacts.Roles.STAGING_TO));
+        artifactBuilder.setRolePayload(
+            RunnerApi.ArtifactStagingToRolePayload.newBuilder()
+                .setStagedName(createStagingFileName(file))
+                .build()
+                .toByteString());
+        lazyArtifactsBuilder.add(
+            file.isDirectory()
+                ? () -> {
+                  File zippedFile;
+                  HashCode hashCode;
+                  try {
+                    zippedFile = zipDirectory(file);
+                    hashCode = Files.asByteSource(zippedFile).hash(Hashing.sha256());
+                  } catch (IOException e) {
+                    throw new RuntimeException(e);
+                  }
+                  return artifactBuilder
+                      .setTypePayload(
+                          RunnerApi.ArtifactFilePayload.newBuilder()
+                              .setPath(zippedFile.getPath())
+                              .setSha256(hashCode.toString())
+                              .build()
+                              .toByteString())
+                      .build();
+                }
+                : () -> {
+                  HashCode hashCode;
+                  try {
+                    hashCode = Files.asByteSource(file).hash(Hashing.sha256());
+                  } catch (IOException e) {
+                    throw new RuntimeException(e);
+                  }
+                  return artifactBuilder
+                      .setTypePayload(
+                          RunnerApi.ArtifactFilePayload.newBuilder()
+                              .setPath(file.getPath())
+                              .setSha256(hashCode.toString())
+                              .build()
+                              .toByteString())
+                      .build();
+                });
       }
     }
-    return filesToStage.build();
+
+    List<Supplier<ArtifactInformation>> lazyArtifacts = lazyArtifactsBuilder.build();
+    String id = UUID.randomUUID().toString();
+    DefaultArtifactResolver.INSTANCE.register(
+        (info) -> {
+          if (BeamUrns.getUrn(StandardArtifacts.Types.DEFERRED).equals(info.getTypeUrn())) {
+            RunnerApi.DeferredArtifactPayload deferredArtifactPayload;
+            try {
+              deferredArtifactPayload =
+                  RunnerApi.DeferredArtifactPayload.parseFrom(info.getTypePayload());
+            } catch (InvalidProtocolBufferException e) {
+              throw new RuntimeException("Error parsing deferred artifact payload.", e);
+            }
+            if (id.equals(deferredArtifactPayload.getKey())) {
+              return lazyArtifacts.stream().map(Supplier::get).collect(Collectors.toList());
+            } else {
+              return ImmutableList.of();
+            }
+          } else {
+            return ImmutableList.of();
+          }
+        });
+
+    return ImmutableList.of(
+        ArtifactInformation.newBuilder()
+            .setTypeUrn(BeamUrns.getUrn(StandardArtifacts.Types.DEFERRED))
+            .setTypePayload(
+                RunnerApi.DeferredArtifactPayload.newBuilder().setKey(id).build().toByteString())
 
 Review comment:
   ```suggestion
       String key = UUID.randomUUID().toString();
       DefaultArtifactResolver.INSTANCE.register(
           (info) -> {
             if (BeamUrns.getUrn(StandardArtifacts.Types.DEFERRED).equals(info.getTypeUrn())) {
               RunnerApi.DeferredArtifactPayload deferredArtifactPayload;
               try {
                 deferredArtifactPayload =
                     RunnerApi.DeferredArtifactPayload.parseFrom(info.getTypePayload());
               } catch (InvalidProtocolBufferException e) {
                 throw new RuntimeException("Error parsing deferred artifact payload.", e);
               }
               if (key.equals(deferredArtifactPayload.getKey())) {
                 return lazyArtifacts.stream().map(Supplier::get).collect(Collectors.toList());
               } else {
                 return ImmutableList.of();
               }
             } else {
               return ImmutableList.of();
             }
           });
   
       return ImmutableList.of(
           ArtifactInformation.newBuilder()
               .setTypeUrn(BeamUrns.getUrn(StandardArtifacts.Types.DEFERRED))
               .setTypePayload(
                   RunnerApi.DeferredArtifactPayload.newBuilder().setKey(key).build().toByteString())
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402697023
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,15 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A unique string identifier assigned by the creator of this payload. The creator may use this key to confirm
+  // whether they can parse the data.
+  string key = 1;
+
+  // A data for deferred artifacts. Interpretation of bytes is delegated to the creator of this payload.
+  bytes data = 2;
 
 Review comment:
   ```suggestion
     bytes payload = 2;
   ```

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610611127
 
 
   @robertwb All comments are addressed. PTAL!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r401930784
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1146,33 +1146,41 @@ message StandardArtifacts {
   enum Types {
     // A URN for locally-accessible artifact files.
     // payload: ArtifactFilePayload
-    FILE     = 0 [(beam_urn) = "beam:artifact:type:file:v1"];
+    FILE      = 0 [(beam_urn) = "beam:artifact:type:file:v1"];
 
     // A URN for artifacts described by URLs.
     // payload: ArtifactUrlPayload
-    URL      = 1 [(beam_urn) = "beam:artifact:type:url:v1"];
+    URL       = 1 [(beam_urn) = "beam:artifact:type:url:v1"];
 
     // A URN for artifacts embedded in ArtifactInformation proto.
     // payload: EmbeddedFilePayload.
-    EMBEDDED = 2 [(beam_urn) = "beam:artifact:type:embedded:v1"];
+    EMBEDDED  = 2 [(beam_urn) = "beam:artifact:type:embedded:v1"];
 
     // A URN for Python artifacts hosted on PYPI.
     // payload: PypiPayload
-    PYPI     = 3 [(beam_urn) = "beam:artifact:type:pypi:v1"];
+    PYPI      = 3 [(beam_urn) = "beam:artifact:type:pypi:v1"];
 
     // A URN for Java artifacts hosted on a Maven repository.
     // payload: MavenPayload
-    MAVEN    = 4 [(beam_urn) = "beam:artifact:type:maven:v1"];
+    MAVEN     = 4 [(beam_urn) = "beam:artifact:type:maven:v1"];
+
+    // A URN for locally-accessible artifact directory.
+    // payload: ArtifactDirectoryPayload
+    DIRECTORY = 5 [(beam_urn) = "beam:artifact:type:directory:v1"];
   }
   enum Roles {
     // A URN for staging-to role.
     // payload: ArtifactStagingToRolePayload
     STAGING_TO  = 0 [(beam_urn) = "beam:artifact:role:staging_to:v1"];
+
+    // A URN for unzip-to role.
+    // payload: ArtifactUnzipToRolePayload
+    UNZIP_TO    = 1 [(beam_urn) = "beam:artifact:role:unzip_to:v1"];
   }
 }
 
 message ArtifactFilePayload {
-  // a string for an artifact path e.g. "/tmp/foo.jar"
+  // a string for an artifact file path e.g. "/tmp/foo.jar"
   string path = 1;
 
   // The hex-encoded sha256 checksum of the artifact.
 
 Review comment:
   adding deferred artifacts and populating sha256.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402494768
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
+  void register(ResolutionFn fn);
 
 Review comment:
   I agree. However, this might be more limited. E.g. one could have a more than one resolver per type, each of which can only resolve a subset, or alternatively one could want to resolve more than one type (or any type), e.g. a proxying resolver. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402698752
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   Thats a good point about the key and in general this will become a problem for all artifacts since none of them have unique keys associated with them unless an intermediary resolves the artifact immediately and possibly "renames" the contents to make it unique.
   
   If we ever want to support multiple layers of expansion for XLang we'll want to proxy any artifact resolution/retrieval calls through the layers.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402684550
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -215,22 +221,76 @@ public static Environment createProcessEnvironment(
     }
 
     ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableMap.Builder<String, Function<ArtifactInformation, ArtifactInformation>>
+        lazyArtifactsBuilder = ImmutableMap.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
-        if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
-          try {
-            filesToStage.add(createArtifactInformation(zipDirectory(file)));
-          } catch (IOException e) {
-            throw new RuntimeException(e);
-          }
-        } else {
-          filesToStage.add(createArtifactInformation(file));
-        }
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        String id = UUID.randomUUID().toString();
 
 Review comment:
   done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610530293
 
 
   Retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402501898
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -215,22 +221,76 @@ public static Environment createProcessEnvironment(
     }
 
     ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableMap.Builder<String, Function<ArtifactInformation, ArtifactInformation>>
+        lazyArtifactsBuilder = ImmutableMap.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
-        if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
-          try {
-            filesToStage.add(createArtifactInformation(zipDirectory(file)));
-          } catch (IOException e) {
-            throw new RuntimeException(e);
-          }
-        } else {
-          filesToStage.add(createArtifactInformation(file));
-        }
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        String id = UUID.randomUUID().toString();
 
 Review comment:
   Yes, I think the goal is to avoid even this enumeration until we actually need it. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402492044
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   +1

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r403390422
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private ResolutionFn resolver =
+      (info) -> {
+        if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+          return Optional.of(info);
+        } else {
+          return Optional.empty();
+        }
+      };
+
+  @Override
+  public void register(ResolutionFn fn) {
+    resolver =
 
 Review comment:
   done.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402482918
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -215,22 +221,76 @@ public static Environment createProcessEnvironment(
     }
 
     ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableMap.Builder<String, Function<ArtifactInformation, ArtifactInformation>>
+        lazyArtifactsBuilder = ImmutableMap.builder();
     for (String path : pathsToStage) {
       File file = new File(path);
-      if (new File(path).exists()) {
-        // Spurious items get added to the classpath. Filter by just those that exist.
-        if (file.isDirectory()) {
-          // Zip up directories so we can upload them to the artifact service.
-          try {
-            filesToStage.add(createArtifactInformation(zipDirectory(file)));
-          } catch (IOException e) {
-            throw new RuntimeException(e);
-          }
-        } else {
-          filesToStage.add(createArtifactInformation(file));
-        }
+      // Spurious items get added to the classpath. Filter by just those that exist.
+      if (file.exists()) {
+        String id = UUID.randomUUID().toString();
 
 Review comment:
   Note that you could put one ID into the map for the entire list of files if you allowed ResolutionFn to return a List/Collection of artifacts.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-604654049
 
 
   R: @lukecwik

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r404411951
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/Environments.java
 ##########
 @@ -214,24 +220,90 @@ public static Environment createProcessEnvironment(
       pathsToStage.addAll(stagingFiles);
     }
 
-    ImmutableList.Builder<ArtifactInformation> filesToStage = ImmutableList.builder();
+    ImmutableList.Builder<Supplier<ArtifactInformation>> lazyArtifactsBuilder =
+        ImmutableList.builder();
     for (String path : pathsToStage) {
 
 Review comment:
   This for loop is fairly cheap and from stream of Suppliers we can easily get additional performance benefits by creating parallelStream. When we consider parallelizing expensive computations, some boilerplate codes are needed anyway in `getNonDeferredArtifacts()`. I think building a stream is a nice way to abstract them out.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
chamikaramj commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-610537944
 
 
   Retest this please

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
robertwb commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402500648
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private ResolutionFn resolver =
+      (info) -> {
+        if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+          return Optional.of(info);
+        } else {
+          return Optional.empty();
+        }
+      };
+
+  @Override
+  public void register(ResolutionFn fn) {
+    resolver =
 
 Review comment:
   I wonder if having an explicit List<ResolutionFn> would be easier to understand than the implicit chaining in these abstract classes. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402474608
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   I was under the impression we were going to make this a `bytes` field so that any deferred information can get passed through and then back to the **creator** whether it be an id or a serialized blob of objects or ...
   
   Allowing for `bytes` enables for solutions beyond in memory maps.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402477389
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/ArtifactResolver.java
 ##########
 @@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Optional;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public interface ArtifactResolver {
 
 Review comment:
   Please add comments to this class and methods.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402482365
 
 

 ##########
 File path: runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction/DefaultArtifactResolver.java
 ##########
 @@ -0,0 +1,83 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.beam.runners.core.construction;
+
+import java.util.Map;
+import java.util.Optional;
+import java.util.stream.Collectors;
+import org.apache.beam.model.pipeline.v1.RunnerApi;
+
+public class DefaultArtifactResolver implements ArtifactResolver {
+  public static ArtifactResolver INSTANCE = new DefaultArtifactResolver();
+
+  private ResolutionFn resolver =
+      (info) -> {
+        if (BeamUrns.getUrn(RunnerApi.StandardArtifacts.Types.FILE).equals(info.getTypeUrn())) {
+          return Optional.of(info);
+        } else {
+          return Optional.empty();
+        }
+      };
+
+  @Override
+  public void register(ResolutionFn fn) {
+    resolver =
+        (info) -> {
+          Optional<RunnerApi.ArtifactInformation> resolved = fn.resolve(info);
+          if (resolved.isPresent()) {
+            return resolved;
+          } else {
+            return resolver.resolve(info);
+          }
+        };
+  }
+
+  @Override
+  public RunnerApi.Pipeline resolveArtifacts(RunnerApi.Pipeline pipeline) {
 
 Review comment:
   The level of nesting in this method is getting a little silly.
   
   Use local variables to logically describe what your doing and consider dropping using stream

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
lukecwik commented on a change in pull request #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#discussion_r402698752
 
 

 ##########
 File path: model/pipeline/src/main/proto/beam_runner_api.proto
 ##########
 @@ -1206,6 +1210,11 @@ message MavenPayload {
   string repository_url = 2;
 }
 
+message DeferredArtifactPayload {
+  // A id for deferred artifacts.
+  string id = 1;
 
 Review comment:
   Thats a good point about the key and in general this will become a problem for all artifacts since none of them have unique keys associated with them unless an intermediary resolves the artifact immediately and possibly "renames" the contents to make it unique.
   
   If we ever want to support multiple layers of expansion for XLang we'll want to proxy any artifact resolution/retrieval calls through the layers and not require each layer to have a *copy* of the artifact.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [beam] ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java

Posted by GitBox <gi...@apache.org>.
ihji commented on issue #11205: [BEAM-9578] Enumerating artifacts is too expensive in Java
URL: https://github.com/apache/beam/pull/11205#issuecomment-607534661
 
 
   CC: @robertwb 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services