You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kenneth Knowles <kl...@google.com> on 2017/12/04 21:13:33 UTC
Re: Schema-Aware PCollections

Nice. Commented a bit on the doc a bit. +1 to working up the Python, Go,
portability implications.

Kenn

On Thu, Nov 30, 2017 at 1:06 PM, Reuven Lax <re...@google.com> wrote:

> Thanks!
>
>
> On Thu, Nov 30, 2017 at 11:25 AM, Holden Karau <ho...@pigscanfly.ca>
> wrote:
>
>> Rocking, I'll start leaving some comments on this. I'm excited to see
>> work being done in this area as well :)
>>
>> On Thu, Nov 30, 2017 at 9:20 AM, Tyler Akidau <ta...@google.com> wrote:
>>
>>> On Wed, Nov 29, 2017 at 6:38 PM Reuven Lax <re...@google.com> wrote:
>>>
>>>> There has been a lot of conversation about schemas on PCollections
>>>> recently. There are a number of reasons for this. Schemas as first-class
>>>> objects in Beam provide a nice base for building BeamSQL. Spark has
>>>> provided schema-support via Dataframes for over two years, and it has
>>>> proved to be very popular among Spark users; it turns out that FlumeJava -
>>>> the original inspiration for the Beam API - has had schema support for even
>>>> longer, though this feature was not included in the Beam (at that time
>>>> Dataflow) API. It turns out that most records have structure, and allowing
>>>> the system to understand record structure can both simplify usage of the
>>>> system and allow for new performance optimizations.
>>>>
>>>> After discussion with JB, Eugene, Kenn, Robert, and a number of others
>>>> on the list, I've started a proposal document here
>>>> <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit?usp=sharing>
>>>> describing how schemas can be added to Beam in a manner that integrates
>>>> with the existing Beam API. The goal is not blindly copy existing systems
>>>> that have schemas, but rather to ensure that we get the best fit for Beam.
>>>> Please comment on this proposal - as much feedback as possible is valuable.
>>>>
>>>> In addition, you may notice this document is incomplete. While it does
>>>> sketch out how schemas can fit into Beam semantically, many portions of
>>>> this design remain to be fleshed out. In particular, the API signatures are
>>>> only sketched at at a high level, exactly what all these APIs will look
>>>> like has not yet been defined. I would welcome help from interested members
>>>> of the community to define these APIs, and to make sure we're covering all
>>>> relevant use cases.
>>>>
>>>
>>> Thanks for sharing this Reuven, I'm excited to see this being discussed.
>>> One global comment: all of the existing examples are in Java. It would be
>>> great if we could design this with Python in mind (and how it could
>>> interact cleanly with Pandas) at the same time. +Robert Bradshaw
>>> <ro...@google.com> , +Holden Karau <hk...@google.com> , and +Ahmet
>>> Altay <al...@google.com> , all whom I've spoken with regarding this and
>>> other Python things recently, just to be sure they see it. But of course
>>> it'd be great if anyone working on Python could jump in.
>>>
>>> -Tyler
>>>
>>>
>>>
>>>>
>>>> Thanks all,
>>>>
>>>> Reuven
>>>>
>>>>
>>>>
>>
>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>>
>
>