You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Matthew Darwin <Ma...@carfinance247.co.uk> on 2019/07/29 15:31:38 UTC

[BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Hi All,


This is my first attempt at a change for Beam on https://issues.apache.org/jira/browse/BEAM-7819.  This parses the message_id when reading from the PubSub protobuf and adds to the message_id, as suggested by the existing documentation - https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes


My code is currently checked into my fork:- https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id


Prior to creating a pull request, I just wanted to check the approach is sensible and then to go through the process for pull request to ensure it goes smoothly.  I do wonder if long term despite the documentation indicated this would be part of the attributes property it might be more sensible to fully mimic the protobuf for the PubSub message.

Kind regards

Matthew

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Posted by Matthew Darwin <Ma...@carfinance247.co.uk>.
So I wasn't paying attention to the message after trying to upload the python docker image (permissions issue) which meant when I ran the dataflow, it couldn't find the docker image. Digging in the stackdriver logs indicated this, and the job was just stuck in a loop waiting for the image. Oops, need to learn to read!

Rectified that, however realised I hadn't changed the _from_protobuf function, which is handled differently, so will take a look at dealing with that.

-----Original Message-----
From: Ahmet Altay <altay@google.com<mailto:Ahmet%20Altay%20%3caltay@google.com%3e>>
Reply-To: dev@beam.apache.org<ma...@beam.apache.org>
To: dev <dev@beam.apache.org<mailto:dev%20%3cdev@beam.apache.org%3e>>, Mark Liu <markliu@google.com<mailto:Mark%20Liu%20%3cmarkliu@google.com%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property
Date: Fri, 02 Aug 2019 18:55:57 -0700



This message originated from outside your organization
________________________________
On Wed, Jul 31, 2019 at 4:19 AM Matthew Darwin <Ma...@carfinance247.co.uk>> wrote:
Hi Ahmet/Udi,

There are a couple of additional tests that failed following my change; apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success and apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success, as the added message_id in the attributes is not expected; I can fix these quite easily, but I'm not clear on what these tests are for, and what the pubsub_matcher is actually supposed to be testing for, so my aribitrary fix to add the expectation of message_id into the attributes property seems odd.


If I remember correctly, pubsub matcher was utility of the test framework. Its purpose is to make it easier to write tests that use pubsub as an output and verify the outputs of those tets.

/cc +Mark Liu<ma...@google.com> to correct me on this.


In terms of testing with gcp dataflow runner, I'm struggling a little. I've followed the guidance on the Python Tips page:-

I've built the tarball:-

python setup.py sdist

Then run the following:-

# Build portable worker
./gradlew :runners:google-cloud-dataflow-java:worker:build -x spotlessJava -x rat -x test
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew -Pdocker-repository-root=gcr.io<http://gcr.io>/dataflow-build/$USER/beam -p sdks/python/container docker
gcloud docker -- push gcr.io<http://gcr.io>/dataflow-build/$USER/beam/python:latest

Then using my simple test pipeline to insert into a big query table run the following:-

python ~/github/Beam-Testing/PythonTestMessageId.py --runner DataflowRunner --project [MY-Project] --input_subscription projects/[MY-Project]/subscriptions/test-apache-beam.subscription --output_table [MY-Project]:test.dataflowrunner --temp_location gs://[MY-Project]/tmp --worker_harness_container_image gcr.io/dataflow-build/$USER/beam/python:latest<http://gcr.io/dataflow-build/$USER/beam/python:latest> --experiment beam_fn_api --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar '../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar' --debug



It is hard for us to debug this without knowing what happened with the Dataflow jobs. Another way to verify this could be running one of the existing tests on your PR. You can try [1] by commenting "Run Python Dataflow ValidatesRunner" on your PR.

[1] https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_Python_ValidatesRunner_Dataflow.groovy

The job starts ok, but when I publish messages, nothing is being read from the queue. I'm assuming this might be to do with the changes made, and whilst it works with the directrunner this is not the case with the dataflow runner; but I'm not receiving any errors in the debug stream on my console, only a series of info around the worker configuration.

This may well be me doing stuff wrong, so apologies if I'm being thick here! Have I missed some steps in the build?

Regards

Matthew

-----Original Message-----
From: Ahmet Altay <altay@google.com<mailto:Ahmet%20Altay%20%3caltay@google.com%3e>>
Reply-To: dev@beam.apache.org<ma...@beam.apache.org>
To: dev <dev@beam.apache.org<mailto:dev%20%3cdev@beam.apache.org%3e>>, Udi Meiri <ehudm@google.com<mailto:Udi%20Meiri%20%3cehudm@google.com%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property
Date: Mon, 29 Jul 2019 09:50:14 -0700

This message originated from outside your organization
________________________________
Hi Matthew,

This looks like a reasonable approach. There is a difference how direct runner reads from pubsub compared to other runners. As you convert to a PR, please pay attention to the difference and add tests for both cases.

On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <Ma...@carfinance247.co.uk>> wrote:

Hi All,


This is my first attempt at a change for Beam on https://issues.apache.org/jira/browse/BEAM-7819.  This parses the message_id when reading from the PubSub protobuf and adds to the message_id, as suggested by the existing documentation - https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes


My code is currently checked into my fork:- https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id


Prior to creating a pull request, I just wanted to check the approach is sensible and then to go through the process for pull request to ensure it goes smoothly.  I do wonder if long term despite the documentation indicated this would be part of the attributes property it might be more sensible to fully mimic the protobuf for the PubSub message.


I agree that making it similar to the protobuf sounds like a more sensible approach. Take a look at how it is handled in Java. There is also value in consistency across languages. Final decision could be made in the PR.

Ahmet

/cc +Udi Meiri<ma...@google.com>


Kind regards

Matthew

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Posted by Ahmet Altay <al...@google.com>.
On Wed, Jul 31, 2019 at 4:19 AM Matthew Darwin <
Matthew.Darwin@carfinance247.co.uk> wrote:

> Hi Ahmet/Udi,
>
> There are a couple of additional tests that failed following my change;
> apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success
> and
> apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success,
> as the added message_id in the attributes is not expected; I can fix these
> quite easily, but I'm not clear on what these tests are for, and what the
> pubsub_matcher is actually supposed to be testing for, so my aribitrary fix
> to add the expectation of message_id into the attributes property seems odd.
>

If I remember correctly, pubsub matcher was utility of the test framework.
Its purpose is to make it easier to write tests that use pubsub as an
output and verify the outputs of those tets.

/cc +Mark Liu <ma...@google.com> to correct me on this.


>
> In terms of testing with gcp dataflow runner, I'm struggling a little.
> I've followed the guidance on the Python Tips page:-
>
> I've built the tarball:-
>
> python setup.py sdist
>
> Then run the following:-
>
> # Build portable worker
> ./gradlew :runners:google-cloud-dataflow-java:worker:build -x
> spotlessJava -x rat -x test
> ./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar
>
> # Build portable Pyhon SDK harness and publish it to GCP
> ./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p
> sdks/python/container docker
> gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest
>
> Then using my simple test pipeline to insert into a big query table run
> the following:-
>
> python ~/github/Beam-Testing/PythonTestMessageId.py --runner
> DataflowRunner --project [MY-Project] --input_subscription
> projects/[MY-Project]/subscriptions/test-apache-beam.subscription
> --output_table [MY-Project]:test.dataflowrunner --temp_location
> gs://[MY-Project]/tmp --worker_harness_container_image
> gcr.io/dataflow-build/$USER/beam/python:latest --experiment beam_fn_api
> --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar
> '../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar'
> --debug
>
>
It is hard for us to debug this without knowing what happened with the
Dataflow jobs. Another way to verify this could be running one of the
existing tests on your PR. You can try [1] by commenting "Run Python
Dataflow ValidatesRunner" on your PR.

[1]
https://github.com/apache/beam/blob/master/.test-infra/jenkins/job_PostCommit_Python_ValidatesRunner_Dataflow.groovy


> The job starts ok, but when I publish messages, nothing is being read from
> the queue. I'm assuming this might be to do with the changes made, and
> whilst it works with the directrunner this is not the case with the
> dataflow runner; but I'm not receiving any errors in the debug stream on my
> console, only a series of info around the worker configuration.
>
> This may well be me doing stuff wrong, so apologies if I'm being thick
> here! Have I missed some steps in the build?
>
> Regards
>
> Matthew
>
> -----Original Message-----
> *From*: Ahmet Altay <altay@google.com
> <Ahmet%20Altay%20%3caltay@google.com%3e>>
> *Reply-To*: dev@beam.apache.org
> *To*: dev <dev@beam.apache.org <dev%20%3cdev@beam.apache.org%3e>>, Udi
> Meiri <ehudm@google.com <Udi%20Meiri%20%3cehudm@google.com%3e>>
> *Subject*: Re: [BEAM-7819] -python - parsing message_id from PubSub
> message to the PubSubMessage attributes property
> *Date*: Mon, 29 Jul 2019 09:50:14 -0700
>
> *This message originated from outside your organization*
> ------------------------------
> Hi Matthew,
>
> This looks like a reasonable approach. There is a difference how direct
> runner reads from pubsub compared to other runners. As you convert to a PR,
> please pay attention to the difference and add tests for both cases.
>
> On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <
> Matthew.Darwin@carfinance247.co.uk> wrote:
>
> Hi All,
>
>
> This is my first attempt at a change for Beam on
> https://issues.apache.org/jira/browse/BEAM-7819.  This parses the
> message_id when reading from the PubSub protobuf and adds to the
> message_id, as suggested by the existing documentation -
> https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes
>
>
> My code is currently checked into my fork:-
> https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id
>
>
> Prior to creating a pull request, I just wanted to check the approach is
> sensible and then to go through the process for pull request to ensure it
> goes smoothly.  I do wonder if long term despite the documentation
> indicated this would be part of the attributes property it might be more
> sensible to fully mimic the protobuf for the PubSub message.
>
>
> I agree that making it similar to the protobuf sounds like a more sensible
> approach. Take a look at how it is handled in Java. There is also value in
> consistency across languages. Final decision could be made in the PR.
>
> Ahmet
>
> /cc +Udi Meiri <eh...@google.com>
>
>
>
> Kind regards
>
> Matthew
>
>

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Posted by Matthew Darwin <Ma...@carfinance247.co.uk>.
Hi Ahmet/Udi,

There are a couple of additional tests that failed following my change; apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_strip_success and apache_beam.io.gcp.tests.pubsub_match_test.PubSubMatcherTest.test_message_matcher_attributes_success, as the added message_id in the attributes is not expected; I can fix these quite easily, but I'm not clear on what these tests are for, and what the pubsub_matcher is actually supposed to be testing for, so my aribitrary fix to add the expectation of message_id into the attributes property seems odd.

In terms of testing with gcp dataflow runner, I'm struggling a little. I've followed the guidance on the Python Tips page:-

I've built the tarball:-

python setup.py sdist

Then run the following:-

# Build portable worker
./gradlew :runners:google-cloud-dataflow-java:worker:build -x spotlessJava -x rat -x test
./gradlew :runners:google-cloud-dataflow-java:worker:shadowJar

# Build portable Pyhon SDK harness and publish it to GCP
./gradlew -Pdocker-repository-root=gcr.io/dataflow-build/$USER/beam -p sdks/python/container docker
gcloud docker -- push gcr.io/dataflow-build/$USER/beam/python:latest

Then using my simple test pipeline to insert into a big query table run the following:-

python ~/github/Beam-Testing/PythonTestMessageId.py --runner DataflowRunner --project [MY-Project] --input_subscription projects/[MY-Project]/subscriptions/test-apache-beam.subscription --output_table [MY-Project]:test.dataflowrunner --temp_location gs://[MY-Project]/tmp --worker_harness_container_image gcr.io/dataflow-build/$USER/beam/python:latest --experiment beam_fn_api --sdk_location dist/apache-beam-2.15.0.dev0.tar.gz --dataflow_worker_jar '../../runners/google-cloud-dataflow-java/worker/build/libs/beam-runners-google-cloud-dataflow-java-fn-api-worker-2.15.0-SNAPSHOT.jar' --debug

The job starts ok, but when I publish messages, nothing is being read from the queue. I'm assuming this might be to do with the changes made, and whilst it works with the directrunner this is not the case with the dataflow runner; but I'm not receiving any errors in the debug stream on my console, only a series of info around the worker configuration.

This may well be me doing stuff wrong, so apologies if I'm being thick here! Have I missed some steps in the build?

Regards

Matthew

-----Original Message-----
From: Ahmet Altay <altay@google.com<mailto:Ahmet%20Altay%20%3caltay@google.com%3e>>
Reply-To: dev@beam.apache.org<ma...@beam.apache.org>
To: dev <dev@beam.apache.org<mailto:dev%20%3cdev@beam.apache.org%3e>>, Udi Meiri <ehudm@google.com<mailto:Udi%20Meiri%20%3cehudm@google.com%3e>>
Subject: Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property
Date: Mon, 29 Jul 2019 09:50:14 -0700

This message originated from outside your organization
________________________________
Hi Matthew,

This looks like a reasonable approach. There is a difference how direct runner reads from pubsub compared to other runners. As you convert to a PR, please pay attention to the difference and add tests for both cases.

On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <Ma...@carfinance247.co.uk>> wrote:

Hi All,


This is my first attempt at a change for Beam on https://issues.apache.org/jira/browse/BEAM-7819.  This parses the message_id when reading from the PubSub protobuf and adds to the message_id, as suggested by the existing documentation - https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes


My code is currently checked into my fork:- https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id


Prior to creating a pull request, I just wanted to check the approach is sensible and then to go through the process for pull request to ensure it goes smoothly.  I do wonder if long term despite the documentation indicated this would be part of the attributes property it might be more sensible to fully mimic the protobuf for the PubSub message.


I agree that making it similar to the protobuf sounds like a more sensible approach. Take a look at how it is handled in Java. There is also value in consistency across languages. Final decision could be made in the PR.

Ahmet

/cc +Udi Meiri<ma...@google.com>


Kind regards

Matthew

Disclaimer

The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.

Re: [BEAM-7819] -python - parsing message_id from PubSub message to the PubSubMessage attributes property

Posted by Ahmet Altay <al...@google.com>.
Hi Matthew,

This looks like a reasonable approach. There is a difference how direct
runner reads from pubsub compared to other runners. As you convert to a PR,
please pay attention to the difference and add tests for both cases.

On Mon, Jul 29, 2019 at 8:35 AM Matthew Darwin <
Matthew.Darwin@carfinance247.co.uk> wrote:

> Hi All,
>
>
> This is my first attempt at a change for Beam on
> https://issues.apache.org/jira/browse/BEAM-7819.  This parses the
> message_id when reading from the PubSub protobuf and adds to the
> message_id, as suggested by the existing documentation -
> https://beam.apache.org/releases/pydoc/2.13.0/apache_beam.io.gcp.pubsub.html#apache_beam.io.gcp.pubsub.PubsubMessage.attributes
>
>
> My code is currently checked into my fork:-
> https://github.com/matt-darwin/beam/tree/BEAM-7819-parse-pubsub-message_id
>
>
> Prior to creating a pull request, I just wanted to check the approach is
> sensible and then to go through the process for pull request to ensure it
> goes smoothly.  I do wonder if long term despite the documentation
> indicated this would be part of the attributes property it might be more
> sensible to fully mimic the protobuf for the PubSub message.
>

I agree that making it similar to the protobuf sounds like a more sensible
approach. Take a look at how it is handled in Java. There is also value in
consistency across languages. Final decision could be made in the PR.

Ahmet

/cc +Udi Meiri <eh...@google.com>

>
>
> Kind regards
>
> Matthew
>