You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ryan Clough (Jira)" <ji...@apache.org> on 2021/10/28 15:34:00 UTC

[jira] [Issue Comment Deleted] (BEAM-12955) Add support for inferring Beam Schemas from Python protobuf types

     [ https://issues.apache.org/jira/browse/BEAM-12955?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ryan Clough updated BEAM-12955:
-------------------------------
    Comment: was deleted

(was: Just wanted to +1 that this would be a really useful feature. I'm reading in a bunch of TF Records (proto) that I want to turn into a dataframe, but I can't get `to_dataframe` to work without creating a schema'd pcollection. Any attempts I've had at generating a schema'd pcoll from a pcoll of proto objects has failed and I'm a bit at a loss.)

> Add support for inferring Beam Schemas from Python protobuf types
> -----------------------------------------------------------------
>
>                 Key: BEAM-12955
>                 URL: https://issues.apache.org/jira/browse/BEAM-12955
>             Project: Beam
>          Issue Type: Improvement
>          Components: sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Svetak Vihaan Sundhar
>            Priority: P2
>              Labels: stale-assigned
>
> Just as we can infer a Beam Schema from a NamedTuple type ([code|https://github.com/apache/beam/blob/master/sdks/python/apache_beam/typehints/schemas.py]), we should have support for inferring a schema from a [protobuf-generated Python type|https://developers.google.com/protocol-buffers/docs/pythontutorial].
> This should integrate well with the rest of the schema infrastructure. For example it should be possible to use schema-aware transforms like [SqlTransform|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.sql.html#apache_beam.transforms.sql.SqlTransform], [Select|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.transforms.core.html#apache_beam.transforms.core.Select], or [beam.dataframe.convert.to_dataframe|https://beam.apache.org/releases/pydoc/2.32.0/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe] on a PCollection that is annotated with a protobuf type. For example (using the addressbook_pb2 example from the [tutorial|https://developers.google.com/protocol-buffers/docs/pythontutorial#reading-a-message]):
> {code:python}
> import adressbook_pb2
> import apache_beam as beam
> from apache_beam.dataframe.convert import to_dataframe
> pc = (input_pc | beam.Map(create_person).with_output_type(addressbook_pb2.Person))
> df = to_dataframe(pc) # deferred dataframe with fields id, name, email, ...
> # OR
> pc | beam.transforms.SqlTransform("SELECT name WHERE email = 'foo@bar.com' FROM PCOLLECTION")
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)