You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Barry Hart (JIRA)" <ji...@apache.org> on 2019/03/07 03:57:00 UTC

[jira] [Comment Edited] (BEAM-6765) Beam 2.10.0 for Python requires pyarrow 0.11.1, which is not installable in Google Cloud DataFlow

    [ https://issues.apache.org/jira/browse/BEAM-6765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16786351#comment-16786351 ] 

Barry Hart edited comment on BEAM-6765 at 3/7/19 3:56 AM:
----------------------------------------------------------

We have a {{requirements.txt}} file which is used for setting up:
* Development environment
* Docker (Kubernetes) image which launches the job

But when submitting the job to DataFlow, we create an altered requirements file, {{prod_requirements.txt}}. This is a copy of the original file with {{apache-beam}} and {{pyarrow}} removed. It looks roughly like the following:

{code}
sed '/^apache-beam/d; /^pyarrow/d' requirements.txt > prod_requirements.txt

python script/beam_run_model.py \
  --project ${gcp_project_name} \
  --runner DataflowRunner \
  --requirements_file prod_requirements.txt \
  --extra_package dist/beam_job-1.0.tar.gz \
  --region us-central1 \
  --worker_machine_type n1-standard-2
{code}

I find this a pretty clunky approach. Ideally, an application should only have _one_ requirements file. The reason this approach works is that when running a job, DataFlow worker instances have a number of [preinstalled Python libraries|https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies], and it's unnecessary to include those libraries in the requirements file since they're already installed.

If our job uses one of those libraries, we list it in {{requirements.txt}}, and we make sure the version listed there matches the DataFlow-preinstalled version. This is another source of complexity, because to my knowledge, this list of libraries and versions is not available in any machine-friendly form. When I have time, I plan to write a little "screen scraper" script to create a partial requirements file from the documentation page linked above. With that and the {{sed}} command listed above, I think I can come up with a fairly automated way to manage requirements for a Beam job. This may seem like overkill, but with a new Beam release every two months or so, this process needs to be pretty easy. I think Google only supports running DataFlow jobs on old Beam releases for a year or two, so it's not wise (or even possible) to avoid updating.

I was considering creating a product enhancement request with Google about this, but I haven't done so yet.


was (Author: barrywhart):
We have a {{requirements.txt}} used for setting up our development environment and the Docker (Kubernetes) image which launches the job.

But when submitting the job to DataFlow, we create an altered requirements file, {{prod_requirements.txt}} without {{apache-beam}} and {{pyarrow}}. It looks roughly like the following:

{code}
sed '/^apache-beam/d; /^pyarrow/d' requirements.txt > prod_requirements.txt

GOOGLE_APPLICATION_CREDENTIALS=$1 python script/beam_run_model.py \
  --project ${gcp_project_name} \
  --runner DataflowRunner \
  --requirements_file prod_requirements.txt \
  --extra_package dist/beam_job-1.0.tar.gz \
  --region us-central1 \
  --worker_machine_type n1-standard-2
{code}

I find this a pretty clunky approach. Ideally, an application should only have _one_ requirements file. The reason this approach works is that when running a job, DataFlow worker instances have a number of [preinstalled Python libraries|https://cloud.google.com/dataflow/docs/concepts/sdk-worker-dependencies], and it's unnecessary to include those libraries in the requirements file since they're already installed. If our job uses one of those libraries, we try and make sure our development environment uses precisely the same version as the preinstalled DataFlow version. This is another source of complexity, because to my knowledge, this list of libraries is not available in any machine-friendly form. When I have time, I plan to write a little "screen scraper" script to create a partial requirements file from the documentation page linked above. With that and the {{sed}} command listed above, I think I can come up with a fairly automated way to manage requirements for a Beam job. This may seem like overkill, but with a new Beam release every two months or so, this process needs to be pretty easy. I think old releases are only supported for a year or two, so it's not wise (or even possible) to avoid updating.

I was considering creating a product enhancement request with Google about this, but I haven't done so yet.

> Beam 2.10.0 for Python requires pyarrow 0.11.1, which is not installable in Google Cloud DataFlow
> -------------------------------------------------------------------------------------------------
>
>                 Key: BEAM-6765
>                 URL: https://issues.apache.org/jira/browse/BEAM-6765
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.10.0
>            Reporter: Barry Hart
>            Priority: Major
>             Fix For: 2.10.0
>
>
> When trying to run a Beam 2.10.0 job in Google Cloud DataFlow, I get the following error:
> {noformat}
> Collecting pyarrow==0.11.1 (from -r requirements.txt (line 51))
> Could not find a version that satisfies the requirement pyarrow==0.11.1 (from -r requirements.txt (line 51)) (from versions: 0.9.0, 0.10.0, 0.11.0, 0.12.1)
> No matching distribution found for pyarrow==0.11.1 (from -r requirements.txt (line 51))
> {noformat}
> This version, while it exists, cannot be installed in Google Cloud DataFlow, because it is only available on PyPI as a wheel, and DataFlow does not allow installing binary packages, only source packages.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)