You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Ufuk Celebi <uc...@apache.org> on 2015/06/01 10:01:31 UTC

Re: [DISCUSS] Inconsistent naming of intermediate results

I would like to get this done with the upcoming release to have a stable
name for the documentation.

Thinking about the names with Stephan, he had a great suggestion to rename
them to "streams".

I like this idea very much. The supported result variants make more sense
when you think about them as streams... blocking vs. pipelined/back
pressure vs. no back pressure/persistent vs. ephemeral streams.

Any opinions on this?


On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org> wrote:

> +1 for the renaming proposed by Ufuk.
>
> @Stephan: At the moment, the IntermediateDataSet is tight to a JobVertex.
> So the renaming makes sense. In the future, it might be constructed
> differently. Only then, JobVertexResult wouldn't make sense anymore. I'm
> not sure if that will even happen.
>
> 4) ResultPartition => Result
> > 5) ResultSubpartition => ResultPartition
> >
>
> Not sure about these. Maybe we should change them to ExecutionResult and
> ExecutionResultPartition because that's more specific and would relate to
> the other class names.
>
> On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
> > To summarize so far: all are in favor of a rename. I agree with both of
> > Henry's points regarding the docs.
> >
> > @Stephan: what would you suggest? I would trust your gut feeling on this
> > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> >
> > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> > > As one of the devs that recently been tracing the runtime portion of
> > > the code +1 for renaming for inlining with the concepts.
> > >
> > > One thing I would like to have is immediate change to the
> > > documentation [1] with renaming PR . Otherwise
> > >
> > > Then need to file followup ticket to update Kostas' awesome wiki page
> > [2].
> > >
> > > - Henry
> > >
> > > [1]
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > >
> > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > > > On a high level we call intermediate data produced by programs
> > > "intermediate results". For example in a WordCount map-reduce program
> the
> > > map function produces an intermediate result, which consists of (word,
> 1)
> > > pairs and the reduce function consumes this intermediate result. Kostas
> > has
> > > recently added documentation explaining the core concepts [1].
> > > >
> > > > The naming of classes related to intermediate results is inconsistent
> > > (and probably confusing).
> > > >
> > > > - In JobGraphs (internal low-level API to define programs) they are
> > > called IntermediateDataSet and identified by IntermediateDataSetIDs.
> > > >
> > > > - In ExecutionGraphs (JobManager structure used for state
> > > tracking/scheduling) they are called IntermediateResult at the
> > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > IntermediateResultPartitionID).
> > > >
> > > > - At runtime (TaskManager) they are called ResultPartition and
> > > identified by ResultPartitionID (composition of ExecutionAttemptID and
> > > IntermediateResultPartitionID). These are further subpartitioned into
> > > ResultSubpartition instances.
> > > >
> > > > I propose to get the naming more in line with the existing naming
> > scheme
> > > and prefix it with the corresponding managemenet structures:
> > > >
> > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > JobVertexResultID)
> > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > JobVertexResultID)
> > > > 3) IntermediateResultPartition => ExecutionVertexResult (identified
> by
> > > ExecutionVertexResultID)
> > > > 4) ResultPartition => Result
> > > > 5) ResultSubpartition => ResultPartition
> > > >
> > > > These names are non-user facing, but still at the core of the
> system. I
> > > think that consistent naming of these classes will make it easier for
> new
> > > contributors to get an overview of how single components relate to each
> > > other (the prefixes indicate this). In the docs, we can still refer to
> > the
> > > high-level concept as "intermediate results".
> > > >
> > > > What's your opinion on this? I think now is a good time to think
> about
> > > this stuff, because the core classes have only been added recently to
> the
> > > system. Feel free to propose alternatives. :-)
> > > >
> > > > – Ufuk
> > > >
> > > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > >
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Stephan Ewen <se...@apache.org>.

I am in principle with Ufuk on that, but let's not rush this into the
release. It is not a public API after all.

On Thu, Jun 4, 2015 at 5:23 PM, Ufuk Celebi <uc...@apache.org> wrote:

>
> On 04 Jun 2015, at 17:02, Maximilian Michels <mx...@apache.org> wrote:
>
> > I think ResultPartition is a pretty accurate description of what it is: a
> > partition of the result of an operator. ResultStream on the other hand,
> > seems very generic to me. Just because we like to think of Flink nowadays
> > as a "streaming data flow" engine, we don't have to change the core
> > classes' names :)
>
> Of course, we don't have to. ;-) But still, I think it makes sense for
> documentation and blog post purposes. The code is new, it's not like
> changing a old component, on which a lot of stuff depends AND we want to
> change the name anyways (see your comments in this thread).

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

On 04 Jun 2015, at 17:02, Maximilian Michels <mx...@apache.org> wrote:

> I think ResultPartition is a pretty accurate description of what it is: a
> partition of the result of an operator. ResultStream on the other hand,
> seems very generic to me. Just because we like to think of Flink nowadays
> as a "streaming data flow" engine, we don't have to change the core
> classes' names :)

Of course, we don't have to. ;-) But still, I think it makes sense for documentation and blog post purposes. The code is new, it's not like changing a old component, on which a lot of stuff depends AND we want to change the name anyways (see your comments in this thread).

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Maximilian Michels <mx...@apache.org>.

I think ResultPartition is a pretty accurate description of what it is: a
partition of the result of an operator. ResultStream on the other hand,
seems very generic to me. Just because we like to think of Flink nowadays
as a "streaming data flow" engine, we don't have to change the core
classes' names :)

On Thu, Jun 4, 2015 at 1:57 PM, Ufuk Celebi <uc...@apache.org> wrote:

>
> On 04 Jun 2015, at 13:10, Maximilian Michels <mx...@apache.org> wrote:
>
> > Rename what to streams? Do you mean "ResultPartition" =>
> "StreamPartition"?
>
> Exactly along those lines, but maybe "ResultStream".
>
> > I'm not sure if that makes it easier to understand what the classes do.
>
> It fits better into the terminology of a "streaming data flow" engines.
>
>
>
>
>
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

On 04 Jun 2015, at 13:10, Maximilian Michels <mx...@apache.org> wrote:

> Rename what to streams? Do you mean "ResultPartition" => "StreamPartition"?

Exactly along those lines, but maybe "ResultStream".

> I'm not sure if that makes it easier to understand what the classes do.

It fits better into the terminology of a "streaming data flow" engines.

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Maximilian Michels <mx...@apache.org>.

Rename what to streams? Do you mean "ResultPartition" => "StreamPartition"?
I'm not sure if that makes it easier to understand what the classes do.

On Mon, Jun 1, 2015 at 10:11 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> +1
> I like it. We are a streaming system underneath after all.
> On Jun 1, 2015 10:02 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>
> > I would like to get this done with the upcoming release to have a stable
> > name for the documentation.
> >
> > Thinking about the names with Stephan, he had a great suggestion to
> rename
> > them to "streams".
> >
> > I like this idea very much. The supported result variants make more sense
> > when you think about them as streams... blocking vs. pipelined/back
> > pressure vs. no back pressure/persistent vs. ephemeral streams.
> >
> > Any opinions on this?
> >
> >
> > On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >
> > > +1 for the renaming proposed by Ufuk.
> > >
> > > @Stephan: At the moment, the IntermediateDataSet is tight to a
> JobVertex.
> > > So the renaming makes sense. In the future, it might be constructed
> > > differently. Only then, JobVertexResult wouldn't make sense anymore.
> I'm
> > > not sure if that will even happen.
> > >
> > > 4) ResultPartition => Result
> > > > 5) ResultSubpartition => ResultPartition
> > > >
> > >
> > > Not sure about these. Maybe we should change them to ExecutionResult
> and
> > > ExecutionResultPartition because that's more specific and would relate
> to
> > > the other class names.
> > >
> > > On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > >
> > > > To summarize so far: all are in favor of a rename. I agree with both
> of
> > > > Henry's points regarding the docs.
> > > >
> > > > @Stephan: what would you suggest? I would trust your gut feeling on
> > this
> > > > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> > > >
> > > > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <
> > henry.saputra@gmail.com>
> > > > wrote:
> > > >
> > > > > As one of the devs that recently been tracing the runtime portion
> of
> > > > > the code +1 for renaming for inlining with the concepts.
> > > > >
> > > > > One thing I would like to have is immediate change to the
> > > > > documentation [1] with renaming PR . Otherwise
> > > > >
> > > > > Then need to file followup ticket to update Kostas' awesome wiki
> page
> > > > [2].
> > > > >
> > > > > - Henry
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > > >
> > > > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> > > > > > On a high level we call intermediate data produced by programs
> > > > > "intermediate results". For example in a WordCount map-reduce
> program
> > > the
> > > > > map function produces an intermediate result, which consists of
> > (word,
> > > 1)
> > > > > pairs and the reduce function consumes this intermediate result.
> > Kostas
> > > > has
> > > > > recently added documentation explaining the core concepts [1].
> > > > > >
> > > > > > The naming of classes related to intermediate results is
> > inconsistent
> > > > > (and probably confusing).
> > > > > >
> > > > > > - In JobGraphs (internal low-level API to define programs) they
> are
> > > > > called IntermediateDataSet and identified by
> IntermediateDataSetIDs.
> > > > > >
> > > > > > - In ExecutionGraphs (JobManager structure used for state
> > > > > tracking/scheduling) they are called IntermediateResult at the
> > > > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > > > IntermediateResultPartitionID).
> > > > > >
> > > > > > - At runtime (TaskManager) they are called ResultPartition and
> > > > > identified by ResultPartitionID (composition of ExecutionAttemptID
> > and
> > > > > IntermediateResultPartitionID). These are further subpartitioned
> into
> > > > > ResultSubpartition instances.
> > > > > >
> > > > > > I propose to get the naming more in line with the existing naming
> > > > scheme
> > > > > and prefix it with the corresponding managemenet structures:
> > > > > >
> > > > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > > > JobVertexResultID)
> > > > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > > > JobVertexResultID)
> > > > > > 3) IntermediateResultPartition => ExecutionVertexResult
> (identified
> > > by
> > > > > ExecutionVertexResultID)
> > > > > > 4) ResultPartition => Result
> > > > > > 5) ResultSubpartition => ResultPartition
> > > > > >
> > > > > > These names are non-user facing, but still at the core of the
> > > system. I
> > > > > think that consistent naming of these classes will make it easier
> for
> > > new
> > > > > contributors to get an overview of how single components relate to
> > each
> > > > > other (the prefixes indicate this). In the docs, we can still refer
> > to
> > > > the
> > > > > high-level concept as "intermediate results".
> > > > > >
> > > > > > What's your opinion on this? I think now is a good time to think
> > > about
> > > > > this stuff, because the core classes have only been added recently
> to
> > > the
> > > > > system. Feel free to propose alternatives. :-)
> > > > > >
> > > > > > – Ufuk
> > > > > >
> > > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Aljoscha Krettek <al...@apache.org>.

+1
I like it. We are a streaming system underneath after all.
On Jun 1, 2015 10:02 AM, "Ufuk Celebi" <uc...@apache.org> wrote:

> I would like to get this done with the upcoming release to have a stable
> name for the documentation.
>
> Thinking about the names with Stephan, he had a great suggestion to rename
> them to "streams".
>
> I like this idea very much. The supported result variants make more sense
> when you think about them as streams... blocking vs. pipelined/back
> pressure vs. no back pressure/persistent vs. ephemeral streams.
>
> Any opinions on this?
>
>
> On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org> wrote:
>
> > +1 for the renaming proposed by Ufuk.
> >
> > @Stephan: At the moment, the IntermediateDataSet is tight to a JobVertex.
> > So the renaming makes sense. In the future, it might be constructed
> > differently. Only then, JobVertexResult wouldn't make sense anymore. I'm
> > not sure if that will even happen.
> >
> > 4) ResultPartition => Result
> > > 5) ResultSubpartition => ResultPartition
> > >
> >
> > Not sure about these. Maybe we should change them to ExecutionResult and
> > ExecutionResultPartition because that's more specific and would relate to
> > the other class names.
> >
> > On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >
> > > To summarize so far: all are in favor of a rename. I agree with both of
> > > Henry's points regarding the docs.
> > >
> > > @Stephan: what would you suggest? I would trust your gut feeling on
> this
> > > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> > >
> > > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <
> henry.saputra@gmail.com>
> > > wrote:
> > >
> > > > As one of the devs that recently been tracing the runtime portion of
> > > > the code +1 for renaming for inlining with the concepts.
> > > >
> > > > One thing I would like to have is immediate change to the
> > > > documentation [1] with renaming PR . Otherwise
> > > >
> > > > Then need to file followup ticket to update Kostas' awesome wiki page
> > > [2].
> > > >
> > > > - Henry
> > > >
> > > > [1]
> > > >
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > > [2]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > >
> > > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > > > > On a high level we call intermediate data produced by programs
> > > > "intermediate results". For example in a WordCount map-reduce program
> > the
> > > > map function produces an intermediate result, which consists of
> (word,
> > 1)
> > > > pairs and the reduce function consumes this intermediate result.
> Kostas
> > > has
> > > > recently added documentation explaining the core concepts [1].
> > > > >
> > > > > The naming of classes related to intermediate results is
> inconsistent
> > > > (and probably confusing).
> > > > >
> > > > > - In JobGraphs (internal low-level API to define programs) they are
> > > > called IntermediateDataSet and identified by IntermediateDataSetIDs.
> > > > >
> > > > > - In ExecutionGraphs (JobManager structure used for state
> > > > tracking/scheduling) they are called IntermediateResult at the
> > > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > > IntermediateResultPartitionID).
> > > > >
> > > > > - At runtime (TaskManager) they are called ResultPartition and
> > > > identified by ResultPartitionID (composition of ExecutionAttemptID
> and
> > > > IntermediateResultPartitionID). These are further subpartitioned into
> > > > ResultSubpartition instances.
> > > > >
> > > > > I propose to get the naming more in line with the existing naming
> > > scheme
> > > > and prefix it with the corresponding managemenet structures:
> > > > >
> > > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > > JobVertexResultID)
> > > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > > JobVertexResultID)
> > > > > 3) IntermediateResultPartition => ExecutionVertexResult (identified
> > by
> > > > ExecutionVertexResultID)
> > > > > 4) ResultPartition => Result
> > > > > 5) ResultSubpartition => ResultPartition
> > > > >
> > > > > These names are non-user facing, but still at the core of the
> > system. I
> > > > think that consistent naming of these classes will make it easier for
> > new
> > > > contributors to get an overview of how single components relate to
> each
> > > > other (the prefixes indicate this). In the docs, we can still refer
> to
> > > the
> > > > high-level concept as "intermediate results".
> > > > >
> > > > > What's your opinion on this? I think now is a good time to think
> > about
> > > > this stuff, because the core classes have only been added recently to
> > the
> > > > system. Feel free to propose alternatives. :-)
> > > > >
> > > > > – Ufuk
> > > > >
> > > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > >
> > >
> >
>