You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Ryan Clough (Jira)" <ji...@apache.org> on 2021/11/17 18:18:00 UTC
[jira] [Comment Edited] (BEAM-13150) Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API

    [ https://issues.apache.org/jira/browse/BEAM-13150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17440618#comment-17440618 ] 

Ryan Clough edited comment on BEAM-13150 at 11/17/21, 6:17 PM:
---------------------------------------------------------------

Thanks [~bhulette] for rightfully moving this into its own issue - I'll continue the discussion from [BEAM-12955|https://issues.apache.org/jira/browse/BEAM-12955] to add some context/color.

I agree that on further reading/understanding of 12955, that tf.example.train is a distinct usecase from protos in general. It is true that TF Examples are just really flexible protos, that depend on a [separate schema proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto] to define the structure for any given set of TF Examples. In the context of TFX, most components that process TF Examples also require the schema as input, probably for this reason.

I think it would be a pretty big benefit if we could find a way to make it work at least somewhat seamlessly with the beam dataframe API, I suspect that this could probably be done with either a unique reader or flag for the existing reader, or alternatively some kind of standardized map that uses the existing TF Record reader and then maps it to a schema aware pcoll that can then be converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's pyarrow record batches qualify as a schema-aware pcoll and get the conversion utilities "for free".

My personal interest/use-case is that at the company I currently work for, we have largely standardized on TFX, which uses TF Records as the data format, and beam for many of the data processing operations (as provided by TFX), but the beam aspect is almost entirely abstracted away from the end user. Thus we are currently in a state where users must choose one of 3 options:
 # Users must make use of an existing TFX component to process their data (transformation, model evaluation/metrics generation, etc), where many of the API abstractions are either too limiting, or too difficult to adapt for more complex use cases (and debugging beam code you didn't write is very difficult when all you know is the abstraction)
 # Users must write single threaded ie script-like processing code, and thus are limited on compute/memory
 # Users must learn and express their data processing in another framework (including beam as an example option)

We're currently stuck in between 1 and 2, as most of our users don't have the time to learn beam on top of their existing priorities. My hope was to use the dataframe API to bridge the gap for #3 - users would only need to define dataframe API operations, and my team could hopefully abstract away most of the beam aspects, which would allow for much more scalable and flexible data processing.

I took a hack week to explore this approach and ran into issues trying to map my input dataset (tf records) to a schema-aware pcoll, which is how I ended up here :) I attempted an approach similar to [this SO post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam], and ran into the same issue, but unfortunately the end original poster didn't leave much context on how they solved their issue. In retrospect, I think the true solution will involve some more advanced input processing with the TF schema as input, as described above. If I have time to delve into this, I'll share anything I'm able to get working, which may help push us towards a working solution for this issue.


was (Author: ryanclough):
Thanks [~bhulette] for rightfully moving this into its own issue - I'll continue the discussion from [12995|https://issues.apache.org/jira/browse/BEAM-12995] to add some context/color.

I agree that on further reading/understanding of 12995, that tf.example.train is a distinct usecase from protos in general. It is true that TF Examples are just really flexible protos, that depend on a [separate schema proto|https://github.com/tensorflow/metadata/blob/master/tensorflow_metadata/proto/v0/schema.proto] to define the structure for any given set of TF Examples. In the context of TFX, most components that process TF Examples also require the schema as input, probably for this reason.

I think it would be a pretty big benefit if we could find a way to make it work at least somewhat seamlessly with the beam dataframe API, I suspect that this could probably be done with either a unique reader or flag for the existing reader, or alternatively some kind of standardized map that uses the existing TF Record reader and then maps it to a schema aware pcoll that can then be converted to dataframe. Alternatively, maybe there's a way to make TFX BSL's pyarrow record batches qualify as a schema-aware pcoll and get the conversion utilities "for free".

My personal interest/use-case is that at the company I currently work for, we have largely standardized on TFX, which uses TF Records as the data format, and beam for many of the data processing operations (as provided by TFX), but the beam aspect is almost entirely abstracted away from the end user. Thus we are currently in a state where users must choose one of 3 options:
 # Users must make use of an existing TFX component to process their data (transformation, model evaluation/metrics generation, etc), where many of the API abstractions are either too limiting, or too difficult to adapt for more complex use cases (and debugging beam code you didn't write is very difficult when all you know is the abstraction)
 # Users must write single threaded ie script-like processing code, and thus are limited on compute/memory
 # Users must learn and express their data processing in another framework (including beam as an example option)

We're currently stuck in between 1 and 2, as most of our users don't have the time to learn beam on top of their existing priorities. My hope was to use the dataframe API to bridge the gap for #3 - users would only need to define dataframe API operations, and my team could hopefully abstract away most of the beam aspects, which would allow for much more scalable and flexible data processing.

I took a hack week to explore this approach and ran into issues trying to map my input dataset (tf records) to a schema-aware pcoll, which is how I ended up here :) I attempted an approach similar to [this SO post|https://stackoverflow.com/questions/68537184/how-to-unpack-dictionary-values-inside-a-beam-map-with-python-apache-beam], and ran into the same issue, but unfortunately the end original poster didn't leave much context on how they solved their issue. In retrospect, I think the true solution will involve some more advanced input processing with the TF schema as input, as described above. If I have time to delve into this, I'll share anything I'm able to get working, which may help push us towards a working solution for this issue.

> Integrate TFRecord/tf.train.Example with Beam Schemas and the DataFrame API
> ---------------------------------------------------------------------------
>
>                 Key: BEAM-13150
>                 URL: https://issues.apache.org/jira/browse/BEAM-13150
>             Project: Beam
>          Issue Type: Improvement
>          Components: dsl-dataframe, sdk-py-core
>            Reporter: Brian Hulette
>            Assignee: Brian Hulette
>            Priority: P2
>
> See discussion in BEAM-12995



--
This message was sent by Atlassian Jira
(v8.20.1#820001)