You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Kenneth Knowles <kl...@google.com.INVALID> on 2017/09/11 17:51:29 UTC
Re: [DISCUSS] Capability Matrix revamp

Closing the loop on this thread, I've summarized the suggestions into a
mega-ticket at https://issues.apache.org/jira/browse/BEAM-2888

Eventually, we'll need a redesign, but there is a lot that we can do
incrementally.

If you want to help, make a subtask for the piece you are handling, or I
can make one if there's a permissions issue.

Kenn

On Thu, Aug 31, 2017 at 2:02 AM, Jean-Baptiste Onofré <jb...@nanthrax.net>
wrote:

> Agree, it sounds like a good idea to me.
>
> Regards
> JB
>
>
> On 08/31/2017 10:35 AM, Etienne Chauchot wrote:
>
>> Hi,
>>
>> I think Nexmark (https://github.com/apache/bea
>> m/tree/master/sdks/java/nexmark) could help in getting quantitative
>> benchmark metrics for all the runners like Tyler suggested.
>>
>> Another thing, the current matrix might be wrong on custom window
>> merging: I think it should be*X *for Spark and Gearpump because of the
>> tickets below (even though I haven't tested it lately, maybe the status has
>> changed)
>>
>> https://issues.apache.org/jira/browse/BEAM-2759**
>>
>> https://issues.apache.org/jira/browse/BEAM-2499
>>
>> But, as Kenn suggested grouping all windowing stuff in merging and
>> non-merging windows sections, maybe this detail does not make sense anymore.
>>
>> Best
>>
>> Etienne
>>
>>
>>
>> Le 23/08/2017 à 04:28, Kenneth Knowles a écrit :
>>
>>> Oh, I missed
>>>
>>> 11. Quantitative properties. This seems like an interesting and important
>>> project all on its own. Since Beam is so generic, we need pretty diverse
>>> measurements for a user to have a hope of extrapolating to their use
>>> case.
>>>
>>> Kenn
>>>
>>> On Tue, Aug 22, 2017 at 7:22 PM, Kenneth Knowles <kl...@google.com> wrote:
>>>
>>> OK, so adding these good ideas to the list:
>>>>
>>>> 8. Plain-English summary that comes before the nitty-gritty.
>>>> 9. Comment on production readiness from maintainers. Maybe testimonials
>>>> are helpful if they can be obtained?
>>>> 10. Versioning of all of the above
>>>>
>>>> Any more thoughts? I'll summarize in a JIRA in a bit.
>>>>
>>>> Kenn
>>>>
>>>> On Tue, Aug 22, 2017 at 10:45 AM, Griselda Cuevas
>>>> <gris@google.com.invalid
>>>>
>>>>> wrote:
>>>>> Hi, I'd also like to ask if versioning as proposed in BEAM-166 <
>>>>> https://issues.apache.org/jira/browse/BEAM-166> is still relevant? If
>>>>> it
>>>>> is, would this be something we want to add to this proposal?
>>>>>
>>>>> G
>>>>>
>>>>> On 21 August 2017 at 08:31, Tyler Akidau <ta...@google.com.invalid>
>>>>> wrote:
>>>>>
>>>>> Is there any way we could add quantitative runner metrics to this as
>>>>>>
>>>>> well?
>>>>>
>>>>>> Like by having some benchmarks that process X amount of data, and then
>>>>>> detailing in the matrix latency, throughput, and (where possible)
>>>>>> cost,
>>>>>> etc, numbers for each of the given runners? Semantic support is one
>>>>>>
>>>>> thing,
>>>>>
>>>>>> but there are other differences between runners that aren't captured
>>>>>> by
>>>>>> just checking feature boxes. I'd be curious if anyone has other ideas
>>>>>> in
>>>>>> this vein as well. The benchmark idea might not be the best way to go
>>>>>>
>>>>> about
>>>>>
>>>>>> it.
>>>>>>
>>>>>> -Tyler
>>>>>>
>>>>>> On Sun, Aug 20, 2017 at 9:43 AM Jesse Anderson <
>>>>>>
>>>>> jesse@bigdatainstitute.io>
>>>>>
>>>>>> wrote:
>>>>>>
>>>>>> It'd be awesome to see these updated. I'd add two more:
>>>>>>>
>>>>>>>     1. A plain English summary of the runner's support in Beam.
>>>>>>> People
>>>>>>>
>>>>>> who
>>>>>
>>>>>>     are new to Beam won't understand the in-depth coverage and need a
>>>>>>> general
>>>>>>>     idea of how it is supported.
>>>>>>>     2. The production readiness of the runner. Does the maintainer
>>>>>>>
>>>>>> think
>>>>>
>>>>>>     this runner is production ready?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Aug 20, 2017 at 8:03 AM Kenneth Knowles
>>>>>>>
>>>>>> <kl...@google.com.invalid>
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi all,
>>>>>>>>
>>>>>>>> I want to revamp
>>>>>>>> https://beam.apache.org/documentation/runners/capability-matrix/
>>>>>>>>
>>>>>>>> When Beam first started, we didn't work on feature branches for the
>>>>>>>>
>>>>>>> core
>>>>>>
>>>>>>> runners, and they had a lot more gaps compared to what goes on
>>>>>>>>
>>>>>>> `master`
>>>>>
>>>>>> today, so this tracked our progress in a way that was easy for
>>>>>>>>
>>>>>>> users to
>>>>>
>>>>>> read. Now it is still our best/only comparison page for users, but I
>>>>>>>>
>>>>>>> think
>>>>>>>
>>>>>>>> we could improve its usefulness.
>>>>>>>>
>>>>>>>> For the benefit of the thread, let me inline all the capabilities
>>>>>>>>
>>>>>>> fully
>>>>>
>>>>>> here:
>>>>>>>>
>>>>>>>> ========================
>>>>>>>>
>>>>>>>> "What is being computed?"
>>>>>>>>   - ParDo
>>>>>>>>   - GroupByKey
>>>>>>>>   - Flatten
>>>>>>>>   - Combine
>>>>>>>>   - Composite Transforms
>>>>>>>>   - Side Inputs
>>>>>>>>   - Source API
>>>>>>>>   - Splittable DoFn
>>>>>>>>   - Metrics
>>>>>>>>   - Stateful Processing
>>>>>>>>
>>>>>>>> "Where in event time?"
>>>>>>>>   - Global windows
>>>>>>>>   - Fixed windows
>>>>>>>>   - Sliding windows
>>>>>>>>   - Session windows
>>>>>>>>   - Custom windows
>>>>>>>>   - Custom merging windows
>>>>>>>>   - Timestamp control
>>>>>>>>
>>>>>>>> "When in processing time?"
>>>>>>>>   - Configurable triggering
>>>>>>>>   - Event-time triggers
>>>>>>>>   - Processing-time triggers
>>>>>>>>   - Count triggers
>>>>>>>>   - [Meta]data driven triggers
>>>>>>>>   - Composite triggers
>>>>>>>>   - Allowed lateness
>>>>>>>>   - Timers
>>>>>>>>
>>>>>>>> "How do refinements relate?"
>>>>>>>>   - Discarding
>>>>>>>>   - Accumulating
>>>>>>>>   - Accumulating & Retracting
>>>>>>>>
>>>>>>>> ========================
>>>>>>>>
>>>>>>>> Here are some issues I'd like to improve:
>>>>>>>>
>>>>>>>>   - Rows that are impossible to not support (ParDo)
>>>>>>>>   - Rows where "support" doesn't really make sense (Composite
>>>>>>>>
>>>>>>> transforms)
>>>>>>
>>>>>>>   - Rows are actually the same model feature (non-merging windowfns)
>>>>>>>>   - Rows that represent optimizations (Combine)
>>>>>>>>   - Rows in the wrong place (Timers)
>>>>>>>>   - Rows have not been designed ([Meta]Data driven triggers)
>>>>>>>>   - Rows with names that appear no where else (Timestamp control)
>>>>>>>>   - No place to compare non-model differences between runners
>>>>>>>>
>>>>>>>> I'm still pondering how to improve this, but I thought I'd send the
>>>>>>>>
>>>>>>> notion
>>>>>>>
>>>>>>>> out for discussion. Some imperfect ideas I've had:
>>>>>>>>
>>>>>>>> 1. Lump all the basic stuff (ParDo, GroupByKey, Read, Window) into
>>>>>>>>
>>>>>>> one
>>>>>
>>>>>> row
>>>>>>>
>>>>>>>> 2. Make sections as users see them, like "ParDo" / "side Inputs" not
>>>>>>>> "What?" / "side inputs"
>>>>>>>> 3. Add rows for non-model things, like portability framework
>>>>>>>>
>>>>>>> support,
>>>>>
>>>>>> metrics backends, etc
>>>>>>>> 4. Drop rows that are not informative, like Composite transforms, or
>>>>>>>>
>>>>>>> not
>>>>>>
>>>>>>> designed
>>>>>>>> 5. Reorganize the windowing section to be just support for merging /
>>>>>>>> non-merging windowing.
>>>>>>>> 6. Switch to a more distinct color scheme than the solid vs faded
>>>>>>>>
>>>>>>> colors
>>>>>>
>>>>>>> currently used.
>>>>>>>> 7. Find a web design to get short descriptions into the foreground
>>>>>>>>
>>>>>>> to
>>>>>
>>>>>> make
>>>>>>>
>>>>>>>> it easier to grok.
>>>>>>>>
>>>>>>>> These are just a few thoughts, and not necessarily compatible with
>>>>>>>>
>>>>>>> each
>>>>>
>>>>>> other. What do you think?
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> --
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jesse
>>>>>>>
>>>>>>>
>>>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbonofre@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>