You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Patrick McQuighan via user <us...@beam.apache.org> on 2023/01/11 01:41:43 UTC

DataFlow Template error - SDK not reporting number of elements processed

user@beam.apache.org

Hi,

I recently started encountering a strange error where a Dataflow job
launched from a template never completes, but runs when launched directly.
The template has been in use since Dec 14 without issue, but trying to
recreate the template today (or the past week) and executing it, results in
one stage of the job sitting at 100% complete for hours, and never
completing.

When trying to run the job directly (i.e. not via template) today, the Logs
Explorer has a confusing message, but does complete:
Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
processing element 535 yet only 535 elements have been sent

When trying to run via template, the following three errors show up:

Element processed sanity check disabled due to SDK not reporting number of
elements processed.

Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
last):
  File
"/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 667, in process_bundle_progress
    processor = self.bundle_processor_cache.lookup(request.instruction_id)
  File
"/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
line 468, in lookup
    raise RuntimeError(
RuntimeError: Bundle processing associated with
process_bundle-7395200449888031466-19 has failed. Check prior failing
response for details.
 [
type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
{ trail_point { source_file_loc { filepath:
"dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
=== Source Location Trace: ===
dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800

SDK failed progress reporting 6 times (limit: 5), no longer holding back
progress to last SDK reported progress.

None of these error messages show up in the template created on Dec 14, so
I'm unsure if some setting or default behavior has been changed or what's
going on. Any help or pointers to debug would be much appreciated.

Thanks,
Patrick

Re: DataFlow Template error - SDK not reporting number of elements processed

Posted by Patrick McQuighan via user <us...@beam.apache.org>.
Hi,

I think I finally managed to track down the difference - the dataflow job
runs correctly when it has the pipeline option tempLocation set (in
addition to temp_location).  I have been having issues trying to get that
field set via the gcloud CLI, but using the python SDK
<https://cloud.google.com/dataflow/docs/reference/rpc/google.dataflow.v1beta3#google.dataflow.v1beta3.LaunchTemplateParameters>
it is being set with "environment": {"tempLocation": temp_location} and
that will launch and execute as expected.  I'm not really sure what the
difference is.

Thanks,
Patrick

On Wed, Jan 11, 2023 at 8:45 AM Patrick McQuighan <pm...@camus.energy>
wrote:

> Hi Bruno,
> Thanks for the response.  The SDK version and all should be identical -
> this issue occurs using code from the exact same commit in git, and the
> dependencies are frozen.  I should mention this is using the python SDK
> version 2.39.0.
>
> Diffing between the templates only appears to show expected differences
> e.g. the tempStoragePrefix points to a different bucket, seralized_fns and
> windowing_strategy are different in a couple of locations but the
> difference is just in the name of a temp directory (e.g. tmpv396o090).  And
> similarly, cannot see differences in Pipeline options.  So I've been
> scratching my head trying to figure out what's going on here :/.
>
> I'll create a ticket with Dataflow support and see if there's something
> with the Dataflow runner that might be causing this issue!
>
> -Patrick
>
>
>
> On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <bv...@google.com> wrote:
>
>> Hi Patrick,
>>
>> I have a few questions that might help troubleshoot this:
>>
>> Did you use the same SDK? Have you updated Beam or any other dependencies?
>> Are there any other error logs (prior to the trace above) that could help
>> understand it?
>> Do you still have the previous template so you can compare the contents?
>> (they are JSON, so formatting and diffing may be sufficient here.)
>> If not, I'd suggest comparing the "Job info" and "Pipeline options" for
>> possible environment/parameter changes.
>>
>> This might be related to a specific runner (Dataflow) rather than the
>> SDK, so if the above doesn't help, a good approach may be contacting
>> Dataflow support and providing specific job IDs so they can give a better
>> look.
>>
>> Best,
>> Bruno
>>
>>
>>
>> On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user <
>> user@beam.apache.org> wrote:
>>
>>> user@beam.apache.org
>>>
>>> Hi,
>>>
>>> I recently started encountering a strange error where a Dataflow job
>>> launched from a template never completes, but runs when launched directly.
>>> The template has been in use since Dec 14 without issue, but trying to
>>> recreate the template today (or the past week) and executing it, results in
>>> one stage of the job sitting at 100% complete for hours, and never
>>> completing.
>>>
>>> When trying to run the job directly (i.e. not via template) today, the
>>> Logs Explorer has a confusing message, but does complete:
>>> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
>>> processing element 535 yet only 535 elements have been sent
>>>
>>> When trying to run via template, the following three errors show up:
>>>
>>> Element processed sanity check disabled due to SDK not reporting number
>>> of elements processed.
>>>
>>> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
>>> last):
>>>   File
>>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>>> line 667, in process_bundle_progress
>>>     processor =
>>> self.bundle_processor_cache.lookup(request.instruction_id)
>>>   File
>>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>>> line 468, in lookup
>>>     raise RuntimeError(
>>> RuntimeError: Bundle processing associated with
>>> process_bundle-7395200449888031466-19 has failed. Check prior failing
>>> response for details.
>>>  [
>>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>>> { trail_point { source_file_loc { filepath:
>>> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
>>> === Source Location Trace: ===
>>> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
>>> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800
>>>
>>> SDK failed progress reporting 6 times (limit: 5), no longer holding back
>>> progress to last SDK reported progress.
>>>
>>> None of these error messages show up in the template created on Dec 14,
>>> so I'm unsure if some setting or default behavior has been changed or
>>> what's going on. Any help or pointers to debug would be much appreciated.
>>>
>>> Thanks,
>>> Patrick
>>>
>>

Re: DataFlow Template error - SDK not reporting number of elements processed

Posted by Patrick McQuighan via user <us...@beam.apache.org>.
Hi Bruno,
Thanks for the response.  The SDK version and all should be identical -
this issue occurs using code from the exact same commit in git, and the
dependencies are frozen.  I should mention this is using the python SDK
version 2.39.0.

Diffing between the templates only appears to show expected differences
e.g. the tempStoragePrefix points to a different bucket, seralized_fns and
windowing_strategy are different in a couple of locations but the
difference is just in the name of a temp directory (e.g. tmpv396o090).  And
similarly, cannot see differences in Pipeline options.  So I've been
scratching my head trying to figure out what's going on here :/.

I'll create a ticket with Dataflow support and see if there's something
with the Dataflow runner that might be causing this issue!

-Patrick



On Tue, Jan 10, 2023 at 8:32 PM Bruno Volpato <bv...@google.com> wrote:

> Hi Patrick,
>
> I have a few questions that might help troubleshoot this:
>
> Did you use the same SDK? Have you updated Beam or any other dependencies?
> Are there any other error logs (prior to the trace above) that could help
> understand it?
> Do you still have the previous template so you can compare the contents?
> (they are JSON, so formatting and diffing may be sufficient here.)
> If not, I'd suggest comparing the "Job info" and "Pipeline options" for
> possible environment/parameter changes.
>
> This might be related to a specific runner (Dataflow) rather than the SDK,
> so if the above doesn't help, a good approach may be contacting Dataflow
> support and providing specific job IDs so they can give a better look.
>
> Best,
> Bruno
>
>
>
> On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user <
> user@beam.apache.org> wrote:
>
>> user@beam.apache.org
>>
>> Hi,
>>
>> I recently started encountering a strange error where a Dataflow job
>> launched from a template never completes, but runs when launched directly.
>> The template has been in use since Dec 14 without issue, but trying to
>> recreate the template today (or the past week) and executing it, results in
>> one stage of the job sitting at 100% complete for hours, and never
>> completing.
>>
>> When trying to run the job directly (i.e. not via template) today, the
>> Logs Explorer has a confusing message, but does complete:
>> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
>> processing element 535 yet only 535 elements have been sent
>>
>> When trying to run via template, the following three errors show up:
>>
>> Element processed sanity check disabled due to SDK not reporting number
>> of elements processed.
>>
>> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
>> last):
>>   File
>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 667, in process_bundle_progress
>>     processor = self.bundle_processor_cache.lookup(request.instruction_id)
>>   File
>> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
>> line 468, in lookup
>>     raise RuntimeError(
>> RuntimeError: Bundle processing associated with
>> process_bundle-7395200449888031466-19 has failed. Check prior failing
>> response for details.
>>  [
>> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
>> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
>> { trail_point { source_file_loc { filepath:
>> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
>> === Source Location Trace: ===
>> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
>> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800
>>
>> SDK failed progress reporting 6 times (limit: 5), no longer holding back
>> progress to last SDK reported progress.
>>
>> None of these error messages show up in the template created on Dec 14,
>> so I'm unsure if some setting or default behavior has been changed or
>> what's going on. Any help or pointers to debug would be much appreciated.
>>
>> Thanks,
>> Patrick
>>
>

Re: DataFlow Template error - SDK not reporting number of elements processed

Posted by Bruno Volpato via user <us...@beam.apache.org>.
Hi Patrick,

I have a few questions that might help troubleshoot this:

Did you use the same SDK? Have you updated Beam or any other dependencies?
Are there any other error logs (prior to the trace above) that could help
understand it?
Do you still have the previous template so you can compare the contents?
(they are JSON, so formatting and diffing may be sufficient here.)
If not, I'd suggest comparing the "Job info" and "Pipeline options" for
possible environment/parameter changes.

This might be related to a specific runner (Dataflow) rather than the SDK,
so if the above doesn't help, a good approach may be contacting Dataflow
support and providing specific job IDs so they can give a better look.

Best,
Bruno



On Tue, Jan 10, 2023 at 8:42 PM Patrick McQuighan via user <
user@beam.apache.org> wrote:

> user@beam.apache.org
>
> Hi,
>
> I recently started encountering a strange error where a Dataflow job
> launched from a template never completes, but runs when launched directly.
> The template has been in use since Dec 14 without issue, but trying to
> recreate the template today (or the past week) and executing it, results in
> one stage of the job sitting at 100% complete for hours, and never
> completing.
>
> When trying to run the job directly (i.e. not via template) today, the
> Logs Explorer has a confusing message, but does complete:
> Error requesting progress from SDK: OUT_OF_RANGE: SDK claims to be
> processing element 535 yet only 535 elements have been sent
>
> When trying to run via template, the following three errors show up:
>
> Element processed sanity check disabled due to SDK not reporting number of
> elements processed.
>
> Error requesting progress from SDK: UNKNOWN: Traceback (most recent call
> last):
>   File
> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
> line 667, in process_bundle_progress
>     processor = self.bundle_processor_cache.lookup(request.instruction_id)
>   File
> "/usr/local/lib/python3.9/site-packages/apache_beam/runners/worker/sdk_worker.py",
> line 468, in lookup
>     raise RuntimeError(
> RuntimeError: Bundle processing associated with
> process_bundle-7395200449888031466-19 has failed. Check prior failing
> response for details.
>  [
> type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto]
> <http://type.googleapis.com/util.MessageSetPayload='%5Bdist_proc.dax.internal.TrailProto%5D>
> { trail_point { source_file_loc { filepath:
> "dist_proc/dax/workflow/worker/fnapi_service_impl.cc" line: 800 } } }']
> === Source Location Trace: ===
> dist_proc/dax/workflow/worker/fnapi_sdk_harness.cc:183
> dist_proc/dax/workflow/worker/fnapi_service_impl.cc:800
>
> SDK failed progress reporting 6 times (limit: 5), no longer holding back
> progress to last SDK reported progress.
>
> None of these error messages show up in the template created on Dec 14, so
> I'm unsure if some setting or default behavior has been changed or what's
> going on. Any help or pointers to debug would be much appreciated.
>
> Thanks,
> Patrick
>