You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Beam JIRA Bot (Jira)" <ji...@apache.org> on 2022/04/22 17:28:00 UTC

[jira] [Commented] (BEAM-13225) Dataflow Prime job fails when providing resource hints on a transform

    [ https://issues.apache.org/jira/browse/BEAM-13225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526585#comment-17526585 ] 

Beam JIRA Bot commented on BEAM-13225:
--------------------------------------

This issue is P2 but has been unassigned without any comment for 60 days so it has been labeled "stale-P2". If this issue is still affecting you, we care! Please comment and remove the label. Otherwise, in 14 days the issue will be moved to P3.

Please see https://beam.apache.org/contribute/jira-priorities/ for a detailed explanation of what these priorities mean.


> Dataflow Prime job fails when providing resource hints on a transform
> ---------------------------------------------------------------------
>
>                 Key: BEAM-13225
>                 URL: https://issues.apache.org/jira/browse/BEAM-13225
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow
>    Affects Versions: 2.32.0
>            Reporter: Brent Worden
>            Priority: P2
>              Labels: stale-P2
>
> I have a classic Dataflow template written using the Apache Beam Java SDK v2.32.0.  The template simply consumes messages from a Pub/Sub subscription and writes them to Google Cloud Storage.
> The template can successfully be used to run jobs with [Dataflow Prime|https://cloud.google.com/dataflow/docs/guides/enable-dataflow-prime] experimental features enabled through {{\-\-additional-experiments enable_prime}} and providing a pipeline level resource hint using {{\-\-parameters=resourceHints=min_ram=8GiB}}:
> {code}
> gcloud dataflow jobs run my-job-name \
>   --additional-experiments enable_prime \
>   --disable-public-ips \
>   --gcs-location gs://bucket/path/to/template \
>   --num-workers 1  \
>   --max-workers 16 \
>   --parameters=resourceHints=min_ram=8GiB,other_pipeline_options=true \
>   --project my-project \
>   --region us-central1 \
>   --service-account-email my-service-account@my-project.iam.gserviceaccount.com \
>   --staging-location gs://bucket/path/to/staging
>   --subnetwork https://www.googleapis.com/compute/v1/projects/my-project/regions/us-central1/subnetworks/my-subnet
> {code}
> In an attempt to use Dataflow Prime's [Right Fitting|https://cloud.google.com/dataflow/docs/guides/right-fitting] capability, I change the pipeline code to include a resource hint on the FileIO transform:
> {code}
> class WriteGcsFileTransform
>     extends PTransform<PCollection<Input>, WriteFilesResult<Destination>> {
>   private static final long serialVersionUID = 1L;
>   @Override
>   public WriteFilesResult<Destination> expand(PCollection<Input> input) {
>     return input.apply(
>         FileIO.<Destination, Input>writeDynamic()
>             .by(myDynamicDestinationFunction)
>             .withDestinationCoder(Destination.coder())
>             .withNumShards(8)
>             .withNaming(myDestinationFileNamingFunction)
>             .withTempDirectory("gs://bucket/path/to/temp")
>             .withCompression(Compression.GZIP)
>             .setResourceHints(ResourceHints.create().withMinRam("32GiB"))
>         );
>   }
> {code}
> Attempting to run jobs from a template based on the new code results in a continuous crash loop with the job never successfully running.  The lone repeated error log entry is:
> {code}
> {
>   "insertId": "s=97e1ecd30e0243609d555685318325b4;i=4e1;b=6c7f5d65f3994eada5f20672dab1daf1;m=912f16c;t=5d024689cb030;x=b36751718b3d80c1",
>   "jsonPayload": {
>     "line": "pod_workers.go:191",
>     "message": "Error syncing pod 4cf7cbf98df4b5e2d054abce7da1262b (\"df-df-hvm-my-job-name-11061310-qn51-harness-jb9f_default(4cf7c6bf982df4b5eb2d054abce7da12)\"), skipping: failed to \"StartContainer\" for \"artifact\" with CrashLoopBackOff: \"back-off 40s restarting failed container=artifact pod=df-df-hvm-my-job-name-11061310-qn51-harness-jb9f_default(4cf7c6bf982df4b5eb2d054abce7da12)\"",
>     "thread": "807"
>   },
>   "resource": {
>     "type": "dataflow_step",
>     "labels": {
>       "project_id": "my-project",
>       "region": "us-central1",
>       "step_id": "",
>       "job_id": "2021-11-06_12_10_27-510057810808146686",
>       "job_name": "my-job-name"
>     }
>   },
>   "timestamp": "2021-11-06T20:14:36.052491Z",
>   "severity": "ERROR",
>   "labels": {
>     "compute.googleapis.com/resource_type": "instance",
>     "dataflow.googleapis.com/log_type": "system",
>     "compute.googleapis.com/resource_id": "4695846446965678007",
>     "dataflow.googleapis.com/job_name": "my-job-name",
>     "dataflow.googleapis.com/job_id": "2021-11-06_12_10_27-510057810808146686",
>     "dataflow.googleapis.com/region": "us-central1",
>     "dataflow.googleapis.com/service_option": "prime",
>     "compute.googleapis.com/resource_name": "df-hvm-my-job-name-11061310-qn51-harness-jb9f"
>   },
>   "logName": "projects/my-project/logs/dataflow.googleapis.com%2Fkubelet",
>   "receiveTimestamp": "2021-11-06T20:14:46.471285909Z"
> }
> {code}
> If the pipeline level resources hints and step level resources hint are both set to 8GiB, the pipeline fails with the same repetitive error.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)