You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Brian Hulette (Jira)" <ji...@apache.org> on 2022/03/30 19:46:00 UTC
[jira] [Commented] (BEAM-14211) Add "interactive" DataFrame operations that eagerly trigger execution
[ https://issues.apache.org/jira/browse/BEAM-14211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514914#comment-17514914 ]
Brian Hulette commented on BEAM-14211:
--------------------------------------
CC: [~robertwb] [~yeandy]
> Add "interactive" DataFrame operations that eagerly trigger execution
> ---------------------------------------------------------------------
>
> Key: BEAM-14211
> URL: https://issues.apache.org/jira/browse/BEAM-14211
> Project: Beam
> Issue Type: Improvement
> Components: dsl-dataframe
> Reporter: Brian Hulette
> Priority: P2
>
> The DataFrame API is completely deferred by design, it means users can quickly build up a pipeline of operations and explicitly execute it when they want to. However the pandas library is designed for eager execution on in-memory datasets, so many operations that users are accustomed to using in pandas are difficult or impossible to implement in a deferred context.
> We should consider adding a set of "interactive" tools that are eagerly executed through tight integration with Interactive Beam (i.e. they call ib.collect() internally). All non-deferred-result, non-deferred-columns, and plotting operations (see [coverage status|https://docs.google.com/spreadsheets/d/1hHAaJ0n0k2tw465ORs5tfdy4Lg0DnGWIQ53cLjAhel0/edit]) could be included in this set.
> We need to make sure that these tools are easily distinguishable from standard, deferred operations. It's important that users are not surprised when these operations trigger execution. I won't prescribe a detailed design here yet, but some approaches to consider:
> - All such operations are defined in a particular namespace ("interactive", "eager", "collect"?), i.e. users would access them as {{df.interactive.plot()}}, {{df.interactive.to_list()}}, {{df.interactive.pivot()}}.
> - When used in a notebook context users could see some interaction (an "are you sure?" dialog, a page to enter parameters like project id, ...) that explains why execution was triggered and gives them an opportunity to abort.
> Ideally this feature would not be tightly coupled to notebooks. Users might want to use these tools in an IPython interpreter, or in a python script (even plots could make sense in this context, the plot operation should return an object that the user can use to write the plot to a png).
--
This message was sent by Atlassian Jira
(v8.20.1#820001)