You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Udi Meiri <eh...@google.com> on 2018/03/20 00:13:59 UTC

Pubsub API feedback

Hi,
I wanted to get feedback about the upcoming Python Pubsub API. It is
currently experimental and only supports reading and writing UTF-8 strings.
My current proposal only concerns reading from Pubsub.

Classes:
- PubsubMessage: encapsulates Pubsub message payload and attributes.

PTransforms:
- ReadMessagesFromPubSub: Outputs elements of type ``PubsubMessage``.

- ReadPayloadsFromPubSub: Outputs elements of type ``str``.

- ReadStringsFromPubSub: Outputs elements of type ``unicode``, decoded from
UTF-8.

Description of common PTransform arguments:
  topic: Cloud Pub/Sub topic in the form
"projects/<project>/topics/<topic>".
    If provided, subscription must be None.
  subscription: Existing Cloud Pub/Sub subscription to use in the
    form "projects/<project>/subscriptions/<subscription>". If not
specified,
    a temporary subscription will be created from the specified topic. If
    provided, topic must be None.
  id_label: The attribute on incoming Pub/Sub messages to use as a unique
    record identifier. When specified, the value of this attribute (which
    can be any string that uniquely identifies the record) will be used for
    deduplication of messages. If not provided, we cannot guarantee
    that no duplicate data will be delivered on the Pub/Sub stream. In this
    case, deduplication of the stream will be strictly best effort.
  timestamp_attribute: Message value to use as element timestamp. If None,
    uses message publishing time as the timestamp.
    Timestamp values should be in one of two formats:
    - A numerical value representing the number of milliseconds since the
Unix
      epoch.
    - A string in RFC 3339 format. For example,
      {@code 2015-10-29T23:41:41.123Z}. The sub-second component of the
      timestamp is optional, and digits beyond the first three (i.e., time
units
      smaller than milliseconds) will be ignored.

Code:
https://github.com/udim/beam/blob/b981dd618e9e1f667597eec2a91c7265a389c405/sdks/python/apache_beam/io/gcp/pubsub.py
PR: https://github.com/apache/beam/pull/4901

Re: Pubsub API feedback

Posted by Ahmet Altay <al...@google.com>.
Thank you Udi. Left some high level comments on the PR.


On Mon, Mar 19, 2018 at 5:13 PM, Udi Meiri <eh...@google.com> wrote:

> Hi,
> I wanted to get feedback about the upcoming Python Pubsub API. It is
> currently experimental and only supports reading and writing UTF-8 strings.
> My current proposal only concerns reading from Pubsub.
>
> Classes:
> - PubsubMessage: encapsulates Pubsub message payload and attributes.
>
> PTransforms:
> - ReadMessagesFromPubSub: Outputs elements of type ``PubsubMessage``.
>
> - ReadPayloadsFromPubSub: Outputs elements of type ``str``.
>
> - ReadStringsFromPubSub: Outputs elements of type ``unicode``, decoded
> from UTF-8.
>
> Description of common PTransform arguments:
>   topic: Cloud Pub/Sub topic in the form "projects/<project>/topics/<
> topic>".
>     If provided, subscription must be None.
>   subscription: Existing Cloud Pub/Sub subscription to use in the
>     form "projects/<project>/subscriptions/<subscription>". If not
> specified,
>     a temporary subscription will be created from the specified topic. If
>     provided, topic must be None.
>   id_label: The attribute on incoming Pub/Sub messages to use as a unique
>     record identifier. When specified, the value of this attribute (which
>     can be any string that uniquely identifies the record) will be used for
>     deduplication of messages. If not provided, we cannot guarantee
>     that no duplicate data will be delivered on the Pub/Sub stream. In this
>     case, deduplication of the stream will be strictly best effort.
>   timestamp_attribute: Message value to use as element timestamp. If None,
>     uses message publishing time as the timestamp.
>     Timestamp values should be in one of two formats:
>     - A numerical value representing the number of milliseconds since the
> Unix
>       epoch.
>     - A string in RFC 3339 format. For example,
>       {@code 2015-10-29T23:41:41.123Z}. The sub-second component of the
>       timestamp is optional, and digits beyond the first three (i.e., time
> units
>       smaller than milliseconds) will be ignored.
>
> Code: https://github.com/udim/beam/blob/b981dd618e9e1f667597eec2a91c72
> 65a389c405/sdks/python/apache_beam/io/gcp/pubsub.py
> PR: https://github.com/apache/beam/pull/4901
>
>