You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@beam.apache.org by "Patrick Linnane (Jira)" <ji...@apache.org> on 2021/04/06 10:34:00 UTC

[jira] [Created] (BEAM-12101) Dataflow Jobs keep failing with FileNotFoundError: [Errno 2] Not found: gs://tmp.../beamapp..../tmp-27400e24c0c31bc1-00000-of-00001.avro

Patrick Linnane created BEAM-12101:
--------------------------------------

             Summary: Dataflow Jobs keep failing with FileNotFoundError: [Errno 2] Not found: gs://tmp.../beamapp..../tmp-27400e24c0c31bc1-00000-of-00001.avro
                 Key: BEAM-12101
                 URL: https://issues.apache.org/jira/browse/BEAM-12101
             Project: Beam
          Issue Type: Bug
          Components: io-py-avro
    Affects Versions: 2.28.0
         Environment: google cloud platform. 
Kicking off the job locally from WSL ubuntu 20.0 
python version 3.8.5 
            Reporter: Patrick Linnane
             Fix For: Not applicable


I am processing up to a 1000 files .......xml.gz
When I run a sample of 128 256, and 512 it works but not always.
I have used between 8 and 512 workers. It seems anytime the job runs for longer then 30 minutes the job fails with FileNotFoundError: errot related to fastavro. 
{code:python}
        lines = (
                p1
                | "Get name" >> beam.Create(names[(no_of_files * (i - 1)) // no_of_jobs: (no_of_files * i) // no_of_jobs])
                | "Read from cloud" >> beam.ParDo(ReadGCS())
                | "Parse into JSON" >> beam.ParDo(ParseXML())
                | "Get Medline" >> beam.ParDo(GetMedline())
                | "Build Json" >> beam.ParDo(JsonBuilder())
                | "Write elements" >> beam.io.WriteToBigQuery(table=table_ref,
                                                              create_disposition=beam.io.BigQueryDisposition.CREATE_IF_NEEDED,
                                                              write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND,
                                                              schema="SCHEMA_AUTODETECT",
                                                              insert_retry_strategy=RetryStrategy.RETRY_ALWAYS,
                                                              ignore_insert_ids=True, validate=False)
        )
{code}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)