You are viewing a plain text version of this content. The canonical link for it is here.

Posted to github@beam.apache.org by "Domlenart (via GitHub)" <gi...@apache.org> on 2023/09/28 10:04:39 UTC

[GitHub] [beam] Domlenart opened a new issue, #28715: [Bug]: Large dataflow jobs get stuck

Domlenart opened a new issue, #28715:
URL: https://github.com/apache/beam/issues/28715

   ### What happened?
   
   Hello,
   
   we are running Dataflow Python jobs using Beam 2.49.0. We are starting those jobs from a notebook using the functionality described [here](https://cloud.google.com/dataflow/docs/guides/interactive-pipeline-development#launch-jobs-from-pipeline). Btw, this example crashes on beam 2.50.0 notebook kernel, I reported this problem to our Google support, let me know if this is something of interest and I will report a separate issue here. 
   
   Problem description:
   
   We have a very simple pipeline which reads data using ReadFromBigQuery, does two beam.Map operations to clean and transform the data to `google.cloud.bigtable.row.DirectRow` and then WriteToBigTable is used to write the data. 
   
   We are testing the performance of BigTable HDD vs SDD based instances, so we wanted to run jobs which insert 10kk and 100kk rows.
   
   Unfortunately, the 10kk job that was writing to HDD instance got stuck after writing 9,999,567 rows. 
   ![image](https://github.com/apache/beam/assets/13343352/1fe805e2-0189-4d47-aca8-e25f770069cb)
   
   
   ![image](https://github.com/apache/beam/assets/13343352/e4cb81d8-852b-4303-9c1e-b12a5f6d2dfa)
   As you can see in the screenshot, the job scaled to about 500 workers, wrote most of the records in ~20min and then it scaled down to 2 workers and no progress was made for ~18h. I cancelled the job manually at that point. 
   
   After rerunning, the job has run to completion in 20 minutes. 
   
   Today, I've started two more jobs, each meant to write 100kk rows to BigTable (one to HDD and the other to SSD based instance). Both got stuck at near completion. Here's some details about one of those jobs:
   ![image](https://github.com/apache/beam/assets/13343352/d6052c92-6a44-416e-9a47-b51c5a70a1c6)
   ![image](https://github.com/apache/beam/assets/13343352/3056715a-b7ec-4ef8-bbab-4c4a66759d1c)
   
   One thing I noticed in all of those jobs is that "stragglers" are detected. 
   ![image](https://github.com/apache/beam/assets/13343352/50112521-7ba1-4f3f-8cce-c5d6e560cd31)
   
   However a reason why they are straggling is undermined:
   
   ![image](https://github.com/apache/beam/assets/13343352/03fd6741-c42a-4f2b-ba42-954d03d49188)
   
   Repro code:
   
   ``` python
   import apache_beam as beam
   from apache_beam.runners.interactive.interactive_runner import InteractiveRunner
   import apache_beam.runners.interactive.interactive_beam as ib
   
   from apache_beam.options import pipeline_options
   from apache_beam.options.pipeline_options import GoogleCloudOptions
   import google.auth
   
   from apache_beam.io.gcp.bigtableio import WriteToBigTable
   from google.cloud.bigtable import row
   import datetime
   import string
   
   from google.cloud import bigtable
   from google.cloud.bigtable import column_family
   from google.cloud.bigtable import row_filters
   from google.cloud.bigtable import column_family
   import logging
   
   from typing import Dict, Any, Tuple, List
   from apache_beam.runners import DataflowRunner
   
   ef to_bt_row(beam_row: Tuple[str, Dict[str, Any]]) -> row.DirectRow:
       import datetime
       """
       Creates BigTable row from standard dataflow row with key mapping to a dict.
       The key is used as a BigTable row key and the dict keys are used as BigTable column names.
       The dict values are used as the column values.
       
       To keep it simple:
       - all columns are assigned to a column family called default
       - the cell timestamp is set to current time
       """
       from google.cloud.bigtable import row as row_
       (key, values) = beam_row
       bt_row = row_.DirectRow(row_key=key)
       for k, v in values.items():
           bt_row.set_cell(
               "default",
               k.encode(),
               str(v).encode(),
               datetime.datetime.now()
           )
       return bt_row
   
   def set_device_id_as_key(row: Dict[str, Any]) -> Tuple[str, Dict[str, Any]]:
       """
       Given dict, convert it to two-element tuple. 
       The first element in the tuple is the original dicts value under "device_id" key.
       The second tuple element is the original dict without the "device_id" key. 
       """
       k = row.pop("device_id")
       return k, row
   
   def insert_data(n: int, source_bq_table: str, instance: str, destination_table:str, jobname="test_job"):
       options = pipeline_options.PipelineOptions(
           flags={},
           job_name=jobname
       )
       _, options.view_as(GoogleCloudOptions).project = google.auth.default()
       options.view_as(GoogleCloudOptions).region = 'us-east1'
       dataflow_gcs_location = 'gs://redacted-gcs-bucket/dataflow'
       options.view_as(GoogleCloudOptions).staging_location = '%s/staging' % dataflow_gcs_location
       options.view_as(GoogleCloudOptions).temp_location = '%s/temp' % dataflow_gcs_location
   
       p = beam.Pipeline(InteractiveRunner())
   
       res = (
           p | 'QueryTable' >> beam.io.ReadFromBigQuery(
               query=f"""
               SELECT* FROM `redacted.redacted.{source_bq_table}` 
               WHERE TIMESTAMP_TRUNC(date_time, DAY) = TIMESTAMP("2023-09-15")
               limit {n}
               """,
               use_standard_sql=True,
               project="redacted",
               use_json_exports=True,
               gcs_location="gs://redactedbucket/bq_reads"
           )
           | "set device id" >> beam.Map(set_device_id_as_key)
           | "create bt rows" >> beam.Map(to_bt_row)
           | "write out" >> WriteToBigTable(
               project_id="another-project",
               instance_id=instance,
               table_id=destination_table
           )
       )
   
       DataflowRunner().run_pipeline(p, options=options)
   
   insert_data(100_000_000, "bq_table_with_100kk_rows", "xyz-ssd", "some_table", "test_100kk_ssd")
   ```
   
   Let me know if you need any further details, I'd be very glad to help!
   
   
   
   ### Issue Priority
   
   Priority: 1 (data loss / total loss of function)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam YAML
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [X] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Domlenart commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "Domlenart (via GitHub)" <gi...@apache.org>.

Domlenart commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1742012199

   According to support, the Dataflow team was contacted about this. Can you please confirm @liferoad? Also, if that's the case can you please provide a bit more detail about next steps and if there will be a bugfix for this problem in a future release? 
   While 2.45 works, it does not support Python 3.11, so once the bug is fixed we'd love to go back to 2.50+. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1740818919

   Please ask them to route this to our Dataflow team.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.

Abacn closed issue #28715: [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock
URL: https://github.com/apache/beam/issues/28715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1740836768

   Thanks a lot. If you hear anything back, please let us know. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock [beam]

Posted by "chamikaramj (via GitHub)" <gi...@apache.org>.

chamikaramj commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1939871894

   @Abacn seems like we need to re-enable the BigTable test here: https://github.com/apache/beam/blob/fa3249206d04295e677520b6df691e93b9b47cba/sdks/python/apache_beam/examples/cookbook/bigtableio_it_test.py#L180


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1742085554

   Just saw the ticket. I will ask our engineers to take a closer look. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Large dataflow jobs get stuck [beam]

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1745248441

   But from the screenshots, the jobs indeed already used `Runner V2`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Domlenart commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "Domlenart (via GitHub)" <gi...@apache.org>.

Domlenart commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1738943199

   I have managed to run the stuck jobs on beam 2.45. I retried 4 jobs with the 100kk rows write so far, and they all ran to completion. 
   
   I will do some more testing on 2.45 and report if we manage to observe this problem on that version as well. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.

Abacn commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1889425126

   > Any update on this ?
   
   It is resolved. Upgrade to Beam 2.53.0 or pin google-cloud-bigtable==2.22.0 for older version between 2.49.0 or 2.52.0 should resolve the issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1739987723

   Please let us know if you filed the ticket. Thanks.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Large dataflow jobs get stuck [beam]

Posted by "PanJ (via GitHub)" <gi...@apache.org>.

PanJ commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1766896214

   I have a similar issue posted [here](https://stackoverflow.com/questions/77293191/apache-beam-pythons-writetobigtable-sometimes-causes-a-step-to-keep-running-inf)
   
   In my case, even a small `WriteToBigTable` job could get stuck (but at a very low chance). Not sure if my logs helps with the diagnosis
   
   ```
   Unable to perform SDK-split for work-id: 5193980908353266575 due to error: INTERNAL: Empty split returned. [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/workflow/worker/fnapi_operators.cc" line: 2738 } } }']
   === Source Location Trace: ===
   dist_proc/dax/internal/status_utils.cc:236
    And could not Checkpoint reader due to error: OUT_OF_RANGE: Cannot checkpoint when range tracker is finished. [type.googleapis.com/util.MessageSetPayload='[dist_proc.dax.internal.TrailProto] { trail_point { source_file_loc { filepath: "dist_proc/dax/workflow/worker/operator.cc" line: 340 } } }']
   === Source Location Trace: ===
   dist_proc/dax/io/dax_reader_driver.cc:253
   dist_proc/dax/workflow/worker/operator.cc:340
   ```
   
   Also, the issue still occurs in `2.51.0` version


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Large dataflow jobs get stuck [beam]

Posted by "ammppp (via GitHub)" <gi...@apache.org>.

ammppp commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1745243883

   I believe the reason the pipeline fails on 2.50 is because of this:  https://github.com/apache/beam/issues/28399
   
   Effectively, when a pipeline is created/run like the example below, it ends up trying to use the Dataflow Runner V1 which is no longer allowed from Beam SDK 2.50+:
    
   `p = beam.Pipeline(InteractiveRunner()) 
   DataflowRunner().run_pipeline(p, options=options)`
   
   A workaround (until 2.51 is released) is to manually specify the "--experiments=use_runner_v2" in the pipeline options.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Python WriteToBigtable get stuck for large jobs due to client dead lock [beam]

Posted by "ee07dazn (via GitHub)" <gi...@apache.org>.

ee07dazn commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1889327147

   Any update on this ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.

tvalentyn commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1739843005

   I think Dataflow support is a proper channel for this issue, we can open a follow up issue Beam SDK improvement once that's clear.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Large dataflow jobs get stuck [beam]

Posted by "Abacn (via GitHub)" <gi...@apache.org>.

Abacn commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1745366608

   > I believe the reason the pipeline fails on 2.50 is because of this: #28399
   > 
   > Effectively, when a pipeline is created/run like the example below, it ends up trying to use the Dataflow Runner V1 which is no longer allowed from Beam SDK 2.50+:
   > 
   > p = beam.Pipeline(InteractiveRunner()) DataflowRunner().run_pipeline(p, options=options)
   > 
   > A workaround (until 2.51 is released) is to manually specify the "--experiments=use_runner_v2" in the pipeline options.
   
   This is unexpected. In Beam 2.49.0 Python, if it does not explicit "disable_runner_v2" experiment, it should default to runner v2, and need `disable_runner_v2` to run on Dataflow legacy runner; in 2.50.0 Python, it should always on runner v2 without the need of specifying experiment.
   
   Let us taking a closer look and please send a customer ticket to Dataflow so we can take a look of your jobId


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1742087719

   @mutianf FYI.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] tvalentyn closed issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "tvalentyn (via GitHub)" <gi...@apache.org>.

tvalentyn closed issue #28715: [Bug]: Large dataflow jobs get stuck
URL: https://github.com/apache/beam/issues/28715


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1739074439

   @Abacn @ahmedabu98 FYI.
   @Domlenart Thanks a lot to report the issue with the detailed repo code. Please ask the cloud support team to reach out the Beam IO team at Google and we can do more debugging from there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Domlenart commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "Domlenart (via GitHub)" <gi...@apache.org>.

Domlenart commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1740442954

   > Please let us know if you filed the cloud support ticket. Thanks.
   
   Yes, right after you've made that request. Unfortunately, support is asking me to try different beam versions (2.46) or downgrade Protobuf. Which I have no interest in doing since I already see in our testing that the bug does not exist on 2.45. Looks like support is trying to blame this on the memory leak issue reported in other tickets. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] Domlenart commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "Domlenart (via GitHub)" <gi...@apache.org>.

Domlenart commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1740824431

   Done. I've repeated the request in the ticket to route it to the proper team internally. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [beam] liferoad commented on issue #28715: [Bug]: Large dataflow jobs get stuck

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1742087314

   This might be related to https://github.com/apache/beam/issues/28562 (Java SDK) based on the symptom. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

Re: [I] [Bug]: Large dataflow jobs get stuck [beam]

Posted by "liferoad (via GitHub)" <gi...@apache.org>.

liferoad commented on issue #28715:
URL: https://github.com/apache/beam/issues/28715#issuecomment-1767109675

   yes, the issue is actually in the bigtable client library. For now, please use  Beam 2.45.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org