You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Ufuk Celebi <uc...@apache.org> on 2015/03/31 16:50:50 UTC

[DISCUSS] Inconsistent naming of intermediate results

On a high level we call intermediate data produced by programs "intermediate results". For example in a WordCount map-reduce program the map function produces an intermediate result, which consists of (word, 1) pairs and the reduce function consumes this intermediate result. Kostas has recently added documentation explaining the core concepts [1].

The naming of classes related to intermediate results is inconsistent (and probably confusing).

- In JobGraphs (internal low-level API to define programs) they are called IntermediateDataSet and identified by IntermediateDataSetIDs.

- In ExecutionGraphs (JobManager structure used for state tracking/scheduling) they are called IntermediateResult at the ExecutionJobVertex (identified by IntermediateDataSetID) and IntermediateResultPartition at the ExecutionVertex (identified by IntermediateResultPartitionID).

- At runtime (TaskManager) they are called ResultPartition and identified by ResultPartitionID (composition of ExecutionAttemptID and IntermediateResultPartitionID). These are further subpartitioned into ResultSubpartition instances.

I propose to get the naming more in line with the existing naming scheme and prefix it with the corresponding managemenet structures:

1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
2) IntermediateResult => ExecutionJobVertexResult (identified by JobVertexResultID)
3) IntermediateResultPartition => ExecutionVertexResult (identified by ExecutionVertexResultID)
4) ResultPartition => Result
5) ResultSubpartition => ResultPartition

These names are non-user facing, but still at the core of the system. I think that consistent naming of these classes will make it easier for new contributors to get an overview of how single components relate to each other (the prefixes indicate this). In the docs, we can still refer to the high-level concept as "intermediate results".

What's your opinion on this? I think now is a good time to think about this stuff, because the core classes have only been added recently to the system. Feel free to propose alternatives. :-)

– Ufuk

[1] https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Stephan Ewen <se...@apache.org>.

I like getting the consistency in there.

I was never thinking of the intermediate data sets to be strictly produced
by a vertex, so I am unsure whether we should use that exact naming scheme,
or one that disconnects the results from the term "VertexResult".

On Tue, Mar 31, 2015 at 5:27 PM, Kostas Tzoumas <kt...@apache.org> wrote:

> I like the fact that the naming scheme follows some logic.
>
> I also like that we have two easy to understand concepts:
> - Operator (be that in any of the above representations)
> - Result (of executing an operator)
>
> +1
>
> On Tue, Mar 31, 2015 at 4:50 PM, Ufuk Celebi <uc...@apache.org> wrote:
>
> > On a high level we call intermediate data produced by programs
> > "intermediate results". For example in a WordCount map-reduce program the
> > map function produces an intermediate result, which consists of (word, 1)
> > pairs and the reduce function consumes this intermediate result. Kostas
> has
> > recently added documentation explaining the core concepts [1].
> >
> > The naming of classes related to intermediate results is inconsistent
> (and
> > probably confusing).
> >
> > - In JobGraphs (internal low-level API to define programs) they are
> called
> > IntermediateDataSet and identified by IntermediateDataSetIDs.
> >
> > - In ExecutionGraphs (JobManager structure used for state
> > tracking/scheduling) they are called IntermediateResult at the
> > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > IntermediateResultPartition at the ExecutionVertex (identified by
> > IntermediateResultPartitionID).
> >
> > - At runtime (TaskManager) they are called ResultPartition and identified
> > by ResultPartitionID (composition of ExecutionAttemptID and
> > IntermediateResultPartitionID). These are further subpartitioned into
> > ResultSubpartition instances.
> >
> > I propose to get the naming more in line with the existing naming scheme
> > and prefix it with the corresponding managemenet structures:
> >
> > 1) IntermediateDataSet => JobVertexResult (identified by
> JobVertexResultID)
> > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > JobVertexResultID)
> > 3) IntermediateResultPartition => ExecutionVertexResult (identified by
> > ExecutionVertexResultID)
> > 4) ResultPartition => Result
> > 5) ResultSubpartition => ResultPartition
> >
> > These names are non-user facing, but still at the core of the system. I
> > think that consistent naming of these classes will make it easier for new
> > contributors to get an overview of how single components relate to each
> > other (the prefixes indicate this). In the docs, we can still refer to
> the
> > high-level concept as "intermediate results".
> >
> > What's your opinion on this? I think now is a good time to think about
> > this stuff, because the core classes have only been added recently to the
> > system. Feel free to propose alternatives. :-)
> >
> > – Ufuk
> >
> > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Kostas Tzoumas <kt...@apache.org>.

I like the fact that the naming scheme follows some logic.

I also like that we have two easy to understand concepts:
- Operator (be that in any of the above representations)
- Result (of executing an operator)

+1

On Tue, Mar 31, 2015 at 4:50 PM, Ufuk Celebi <uc...@apache.org> wrote:

> On a high level we call intermediate data produced by programs
> "intermediate results". For example in a WordCount map-reduce program the
> map function produces an intermediate result, which consists of (word, 1)
> pairs and the reduce function consumes this intermediate result. Kostas has
> recently added documentation explaining the core concepts [1].
>
> The naming of classes related to intermediate results is inconsistent (and
> probably confusing).
>
> - In JobGraphs (internal low-level API to define programs) they are called
> IntermediateDataSet and identified by IntermediateDataSetIDs.
>
> - In ExecutionGraphs (JobManager structure used for state
> tracking/scheduling) they are called IntermediateResult at the
> ExecutionJobVertex (identified by IntermediateDataSetID) and
> IntermediateResultPartition at the ExecutionVertex (identified by
> IntermediateResultPartitionID).
>
> - At runtime (TaskManager) they are called ResultPartition and identified
> by ResultPartitionID (composition of ExecutionAttemptID and
> IntermediateResultPartitionID). These are further subpartitioned into
> ResultSubpartition instances.
>
> I propose to get the naming more in line with the existing naming scheme
> and prefix it with the corresponding managemenet structures:
>
> 1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
> 2) IntermediateResult => ExecutionJobVertexResult (identified by
> JobVertexResultID)
> 3) IntermediateResultPartition => ExecutionVertexResult (identified by
> ExecutionVertexResultID)
> 4) ResultPartition => Result
> 5) ResultSubpartition => ResultPartition
>
> These names are non-user facing, but still at the core of the system. I
> think that consistent naming of these classes will make it easier for new
> contributors to get an overview of how single components relate to each
> other (the prefixes indicate this). In the docs, we can still refer to the
> high-level concept as "intermediate results".
>
> What's your opinion on this? I think now is a good time to think about
> this stuff, because the core classes have only been added recently to the
> system. Feel free to propose alternatives. :-)
>
> – Ufuk
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Stephan Ewen <se...@apache.org>.

I am in principle with Ufuk on that, but let's not rush this into the
release. It is not a public API after all.

On Thu, Jun 4, 2015 at 5:23 PM, Ufuk Celebi <uc...@apache.org> wrote:

>
> On 04 Jun 2015, at 17:02, Maximilian Michels <mx...@apache.org> wrote:
>
> > I think ResultPartition is a pretty accurate description of what it is: a
> > partition of the result of an operator. ResultStream on the other hand,
> > seems very generic to me. Just because we like to think of Flink nowadays
> > as a "streaming data flow" engine, we don't have to change the core
> > classes' names :)
>
> Of course, we don't have to. ;-) But still, I think it makes sense for
> documentation and blog post purposes. The code is new, it's not like
> changing a old component, on which a lot of stuff depends AND we want to
> change the name anyways (see your comments in this thread).

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

On 04 Jun 2015, at 17:02, Maximilian Michels <mx...@apache.org> wrote:

> I think ResultPartition is a pretty accurate description of what it is: a
> partition of the result of an operator. ResultStream on the other hand,
> seems very generic to me. Just because we like to think of Flink nowadays
> as a "streaming data flow" engine, we don't have to change the core
> classes' names :)

Of course, we don't have to. ;-) But still, I think it makes sense for documentation and blog post purposes. The code is new, it's not like changing a old component, on which a lot of stuff depends AND we want to change the name anyways (see your comments in this thread).

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Maximilian Michels <mx...@apache.org>.

I think ResultPartition is a pretty accurate description of what it is: a
partition of the result of an operator. ResultStream on the other hand,
seems very generic to me. Just because we like to think of Flink nowadays
as a "streaming data flow" engine, we don't have to change the core
classes' names :)

On Thu, Jun 4, 2015 at 1:57 PM, Ufuk Celebi <uc...@apache.org> wrote:

>
> On 04 Jun 2015, at 13:10, Maximilian Michels <mx...@apache.org> wrote:
>
> > Rename what to streams? Do you mean "ResultPartition" =>
> "StreamPartition"?
>
> Exactly along those lines, but maybe "ResultStream".
>
> > I'm not sure if that makes it easier to understand what the classes do.
>
> It fits better into the terminology of a "streaming data flow" engines.
>
>
>
>
>
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

On 04 Jun 2015, at 13:10, Maximilian Michels <mx...@apache.org> wrote:

> Rename what to streams? Do you mean "ResultPartition" => "StreamPartition"?

Exactly along those lines, but maybe "ResultStream".

> I'm not sure if that makes it easier to understand what the classes do.

It fits better into the terminology of a "streaming data flow" engines.

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Maximilian Michels <mx...@apache.org>.

Rename what to streams? Do you mean "ResultPartition" => "StreamPartition"?
I'm not sure if that makes it easier to understand what the classes do.

On Mon, Jun 1, 2015 at 10:11 AM, Aljoscha Krettek <al...@apache.org>
wrote:

> +1
> I like it. We are a streaming system underneath after all.
> On Jun 1, 2015 10:02 AM, "Ufuk Celebi" <uc...@apache.org> wrote:
>
> > I would like to get this done with the upcoming release to have a stable
> > name for the documentation.
> >
> > Thinking about the names with Stephan, he had a great suggestion to
> rename
> > them to "streams".
> >
> > I like this idea very much. The supported result variants make more sense
> > when you think about them as streams... blocking vs. pipelined/back
> > pressure vs. no back pressure/persistent vs. ephemeral streams.
> >
> > Any opinions on this?
> >
> >
> > On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org>
> wrote:
> >
> > > +1 for the renaming proposed by Ufuk.
> > >
> > > @Stephan: At the moment, the IntermediateDataSet is tight to a
> JobVertex.
> > > So the renaming makes sense. In the future, it might be constructed
> > > differently. Only then, JobVertexResult wouldn't make sense anymore.
> I'm
> > > not sure if that will even happen.
> > >
> > > 4) ResultPartition => Result
> > > > 5) ResultSubpartition => ResultPartition
> > > >
> > >
> > > Not sure about these. Maybe we should change them to ExecutionResult
> and
> > > ExecutionResultPartition because that's more specific and would relate
> to
> > > the other class names.
> > >
> > > On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > >
> > > > To summarize so far: all are in favor of a rename. I agree with both
> of
> > > > Henry's points regarding the docs.
> > > >
> > > > @Stephan: what would you suggest? I would trust your gut feeling on
> > this
> > > > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> > > >
> > > > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <
> > henry.saputra@gmail.com>
> > > > wrote:
> > > >
> > > > > As one of the devs that recently been tracing the runtime portion
> of
> > > > > the code +1 for renaming for inlining with the concepts.
> > > > >
> > > > > One thing I would like to have is immediate change to the
> > > > > documentation [1] with renaming PR . Otherwise
> > > > >
> > > > > Then need to file followup ticket to update Kostas' awesome wiki
> page
> > > > [2].
> > > > >
> > > > > - Henry
> > > > >
> > > > > [1]
> > > > >
> > > >
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > > > [2]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > > >
> > > > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org>
> wrote:
> > > > > > On a high level we call intermediate data produced by programs
> > > > > "intermediate results". For example in a WordCount map-reduce
> program
> > > the
> > > > > map function produces an intermediate result, which consists of
> > (word,
> > > 1)
> > > > > pairs and the reduce function consumes this intermediate result.
> > Kostas
> > > > has
> > > > > recently added documentation explaining the core concepts [1].
> > > > > >
> > > > > > The naming of classes related to intermediate results is
> > inconsistent
> > > > > (and probably confusing).
> > > > > >
> > > > > > - In JobGraphs (internal low-level API to define programs) they
> are
> > > > > called IntermediateDataSet and identified by
> IntermediateDataSetIDs.
> > > > > >
> > > > > > - In ExecutionGraphs (JobManager structure used for state
> > > > > tracking/scheduling) they are called IntermediateResult at the
> > > > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > > > IntermediateResultPartitionID).
> > > > > >
> > > > > > - At runtime (TaskManager) they are called ResultPartition and
> > > > > identified by ResultPartitionID (composition of ExecutionAttemptID
> > and
> > > > > IntermediateResultPartitionID). These are further subpartitioned
> into
> > > > > ResultSubpartition instances.
> > > > > >
> > > > > > I propose to get the naming more in line with the existing naming
> > > > scheme
> > > > > and prefix it with the corresponding managemenet structures:
> > > > > >
> > > > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > > > JobVertexResultID)
> > > > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > > > JobVertexResultID)
> > > > > > 3) IntermediateResultPartition => ExecutionVertexResult
> (identified
> > > by
> > > > > ExecutionVertexResultID)
> > > > > > 4) ResultPartition => Result
> > > > > > 5) ResultSubpartition => ResultPartition
> > > > > >
> > > > > > These names are non-user facing, but still at the core of the
> > > system. I
> > > > > think that consistent naming of these classes will make it easier
> for
> > > new
> > > > > contributors to get an overview of how single components relate to
> > each
> > > > > other (the prefixes indicate this). In the docs, we can still refer
> > to
> > > > the
> > > > > high-level concept as "intermediate results".
> > > > > >
> > > > > > What's your opinion on this? I think now is a good time to think
> > > about
> > > > > this stuff, because the core classes have only been added recently
> to
> > > the
> > > > > system. Feel free to propose alternatives. :-)
> > > > > >
> > > > > > – Ufuk
> > > > > >
> > > > > > [1]
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Aljoscha Krettek <al...@apache.org>.

+1
I like it. We are a streaming system underneath after all.
On Jun 1, 2015 10:02 AM, "Ufuk Celebi" <uc...@apache.org> wrote:

> I would like to get this done with the upcoming release to have a stable
> name for the documentation.
>
> Thinking about the names with Stephan, he had a great suggestion to rename
> them to "streams".
>
> I like this idea very much. The supported result variants make more sense
> when you think about them as streams... blocking vs. pipelined/back
> pressure vs. no back pressure/persistent vs. ephemeral streams.
>
> Any opinions on this?
>
>
> On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org> wrote:
>
> > +1 for the renaming proposed by Ufuk.
> >
> > @Stephan: At the moment, the IntermediateDataSet is tight to a JobVertex.
> > So the renaming makes sense. In the future, it might be constructed
> > differently. Only then, JobVertexResult wouldn't make sense anymore. I'm
> > not sure if that will even happen.
> >
> > 4) ResultPartition => Result
> > > 5) ResultSubpartition => ResultPartition
> > >
> >
> > Not sure about these. Maybe we should change them to ExecutionResult and
> > ExecutionResultPartition because that's more specific and would relate to
> > the other class names.
> >
> > On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
> >
> > > To summarize so far: all are in favor of a rename. I agree with both of
> > > Henry's points regarding the docs.
> > >
> > > @Stephan: what would you suggest? I would trust your gut feeling on
> this
> > > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> > >
> > > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <
> henry.saputra@gmail.com>
> > > wrote:
> > >
> > > > As one of the devs that recently been tracing the runtime portion of
> > > > the code +1 for renaming for inlining with the concepts.
> > > >
> > > > One thing I would like to have is immediate change to the
> > > > documentation [1] with renaming PR . Otherwise
> > > >
> > > > Then need to file followup ticket to update Kostas' awesome wiki page
> > > [2].
> > > >
> > > > - Henry
> > > >
> > > > [1]
> > > >
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > > [2]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > >
> > > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > > > > On a high level we call intermediate data produced by programs
> > > > "intermediate results". For example in a WordCount map-reduce program
> > the
> > > > map function produces an intermediate result, which consists of
> (word,
> > 1)
> > > > pairs and the reduce function consumes this intermediate result.
> Kostas
> > > has
> > > > recently added documentation explaining the core concepts [1].
> > > > >
> > > > > The naming of classes related to intermediate results is
> inconsistent
> > > > (and probably confusing).
> > > > >
> > > > > - In JobGraphs (internal low-level API to define programs) they are
> > > > called IntermediateDataSet and identified by IntermediateDataSetIDs.
> > > > >
> > > > > - In ExecutionGraphs (JobManager structure used for state
> > > > tracking/scheduling) they are called IntermediateResult at the
> > > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > > IntermediateResultPartitionID).
> > > > >
> > > > > - At runtime (TaskManager) they are called ResultPartition and
> > > > identified by ResultPartitionID (composition of ExecutionAttemptID
> and
> > > > IntermediateResultPartitionID). These are further subpartitioned into
> > > > ResultSubpartition instances.
> > > > >
> > > > > I propose to get the naming more in line with the existing naming
> > > scheme
> > > > and prefix it with the corresponding managemenet structures:
> > > > >
> > > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > > JobVertexResultID)
> > > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > > JobVertexResultID)
> > > > > 3) IntermediateResultPartition => ExecutionVertexResult (identified
> > by
> > > > ExecutionVertexResultID)
> > > > > 4) ResultPartition => Result
> > > > > 5) ResultSubpartition => ResultPartition
> > > > >
> > > > > These names are non-user facing, but still at the core of the
> > system. I
> > > > think that consistent naming of these classes will make it easier for
> > new
> > > > contributors to get an overview of how single components relate to
> each
> > > > other (the prefixes indicate this). In the docs, we can still refer
> to
> > > the
> > > > high-level concept as "intermediate results".
> > > > >
> > > > > What's your opinion on this? I think now is a good time to think
> > about
> > > > this stuff, because the core classes have only been added recently to
> > the
> > > > system. Feel free to propose alternatives. :-)
> > > > >
> > > > > – Ufuk
> > > > >
> > > > > [1]
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > > >
> > >
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

I would like to get this done with the upcoming release to have a stable
name for the documentation.

Thinking about the names with Stephan, he had a great suggestion to rename
them to "streams".

I like this idea very much. The supported result variants make more sense
when you think about them as streams... blocking vs. pipelined/back
pressure vs. no back pressure/persistent vs. ephemeral streams.

Any opinions on this?


On Wed, Apr 1, 2015 at 3:39 PM, Maximilian Michels <mx...@apache.org> wrote:

> +1 for the renaming proposed by Ufuk.
>
> @Stephan: At the moment, the IntermediateDataSet is tight to a JobVertex.
> So the renaming makes sense. In the future, it might be constructed
> differently. Only then, JobVertexResult wouldn't make sense anymore. I'm
> not sure if that will even happen.
>
> 4) ResultPartition => Result
> > 5) ResultSubpartition => ResultPartition
> >
>
> Not sure about these. Maybe we should change them to ExecutionResult and
> ExecutionResultPartition because that's more specific and would relate to
> the other class names.
>
> On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:
>
> > To summarize so far: all are in favor of a rename. I agree with both of
> > Henry's points regarding the docs.
> >
> > @Stephan: what would you suggest? I would trust your gut feeling on this
> > one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
> >
> > On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <he...@gmail.com>
> > wrote:
> >
> > > As one of the devs that recently been tracing the runtime portion of
> > > the code +1 for renaming for inlining with the concepts.
> > >
> > > One thing I would like to have is immediate change to the
> > > documentation [1] with renaming PR . Otherwise
> > >
> > > Then need to file followup ticket to update Kostas' awesome wiki page
> > [2].
> > >
> > > - Henry
> > >
> > > [1]
> > >
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > > [2]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > >
> > > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > > > On a high level we call intermediate data produced by programs
> > > "intermediate results". For example in a WordCount map-reduce program
> the
> > > map function produces an intermediate result, which consists of (word,
> 1)
> > > pairs and the reduce function consumes this intermediate result. Kostas
> > has
> > > recently added documentation explaining the core concepts [1].
> > > >
> > > > The naming of classes related to intermediate results is inconsistent
> > > (and probably confusing).
> > > >
> > > > - In JobGraphs (internal low-level API to define programs) they are
> > > called IntermediateDataSet and identified by IntermediateDataSetIDs.
> > > >
> > > > - In ExecutionGraphs (JobManager structure used for state
> > > tracking/scheduling) they are called IntermediateResult at the
> > > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > > IntermediateResultPartition at the ExecutionVertex (identified by
> > > IntermediateResultPartitionID).
> > > >
> > > > - At runtime (TaskManager) they are called ResultPartition and
> > > identified by ResultPartitionID (composition of ExecutionAttemptID and
> > > IntermediateResultPartitionID). These are further subpartitioned into
> > > ResultSubpartition instances.
> > > >
> > > > I propose to get the naming more in line with the existing naming
> > scheme
> > > and prefix it with the corresponding managemenet structures:
> > > >
> > > > 1) IntermediateDataSet => JobVertexResult (identified by
> > > JobVertexResultID)
> > > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > > JobVertexResultID)
> > > > 3) IntermediateResultPartition => ExecutionVertexResult (identified
> by
> > > ExecutionVertexResultID)
> > > > 4) ResultPartition => Result
> > > > 5) ResultSubpartition => ResultPartition
> > > >
> > > > These names are non-user facing, but still at the core of the
> system. I
> > > think that consistent naming of these classes will make it easier for
> new
> > > contributors to get an overview of how single components relate to each
> > > other (the prefixes indicate this). In the docs, we can still refer to
> > the
> > > high-level concept as "intermediate results".
> > > >
> > > > What's your opinion on this? I think now is a good time to think
> about
> > > this stuff, because the core classes have only been added recently to
> the
> > > system. Feel free to propose alternatives. :-)
> > > >
> > > > – Ufuk
> > > >
> > > > [1]
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> > >
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Maximilian Michels <mx...@apache.org>.

+1 for the renaming proposed by Ufuk.

@Stephan: At the moment, the IntermediateDataSet is tight to a JobVertex.
So the renaming makes sense. In the future, it might be constructed
differently. Only then, JobVertexResult wouldn't make sense anymore. I'm
not sure if that will even happen.

4) ResultPartition => Result
> 5) ResultSubpartition => ResultPartition
>

Not sure about these. Maybe we should change them to ExecutionResult and
ExecutionResultPartition because that's more specific and would relate to
the other class names.

On Wed, Apr 1, 2015 at 10:39 AM, Ufuk Celebi <uc...@apache.org> wrote:

> To summarize so far: all are in favor of a rename. I agree with both of
> Henry's points regarding the docs.
>
> @Stephan: what would you suggest? I would trust your gut feeling on this
> one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?
>
> On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <he...@gmail.com>
> wrote:
>
> > As one of the devs that recently been tracing the runtime portion of
> > the code +1 for renaming for inlining with the concepts.
> >
> > One thing I would like to have is immediate change to the
> > documentation [1] with renaming PR . Otherwise
> >
> > Then need to file followup ticket to update Kostas' awesome wiki page
> [2].
> >
> > - Henry
> >
> > [1]
> >
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> > [2]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> >
> > On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > > On a high level we call intermediate data produced by programs
> > "intermediate results". For example in a WordCount map-reduce program the
> > map function produces an intermediate result, which consists of (word, 1)
> > pairs and the reduce function consumes this intermediate result. Kostas
> has
> > recently added documentation explaining the core concepts [1].
> > >
> > > The naming of classes related to intermediate results is inconsistent
> > (and probably confusing).
> > >
> > > - In JobGraphs (internal low-level API to define programs) they are
> > called IntermediateDataSet and identified by IntermediateDataSetIDs.
> > >
> > > - In ExecutionGraphs (JobManager structure used for state
> > tracking/scheduling) they are called IntermediateResult at the
> > ExecutionJobVertex (identified by IntermediateDataSetID) and
> > IntermediateResultPartition at the ExecutionVertex (identified by
> > IntermediateResultPartitionID).
> > >
> > > - At runtime (TaskManager) they are called ResultPartition and
> > identified by ResultPartitionID (composition of ExecutionAttemptID and
> > IntermediateResultPartitionID). These are further subpartitioned into
> > ResultSubpartition instances.
> > >
> > > I propose to get the naming more in line with the existing naming
> scheme
> > and prefix it with the corresponding managemenet structures:
> > >
> > > 1) IntermediateDataSet => JobVertexResult (identified by
> > JobVertexResultID)
> > > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> > JobVertexResultID)
> > > 3) IntermediateResultPartition => ExecutionVertexResult (identified by
> > ExecutionVertexResultID)
> > > 4) ResultPartition => Result
> > > 5) ResultSubpartition => ResultPartition
> > >
> > > These names are non-user facing, but still at the core of the system. I
> > think that consistent naming of these classes will make it easier for new
> > contributors to get an overview of how single components relate to each
> > other (the prefixes indicate this). In the docs, we can still refer to
> the
> > high-level concept as "intermediate results".
> > >
> > > What's your opinion on this? I think now is a good time to think about
> > this stuff, because the core classes have only been added recently to the
> > system. Feel free to propose alternatives. :-)
> > >
> > > – Ufuk
> > >
> > > [1]
> >
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
> >
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Ufuk Celebi <uc...@apache.org>.

To summarize so far: all are in favor of a rename. I agree with both of
Henry's points regarding the docs.

@Stephan: what would you suggest? I would trust your gut feeling on this
one. ;) JobResult, ExecutionJobResult, ExecutionResult, etc.?

On Tue, Mar 31, 2015 at 8:16 PM, Henry Saputra <he...@gmail.com>
wrote:

> As one of the devs that recently been tracing the runtime portion of
> the code +1 for renaming for inlining with the concepts.
>
> One thing I would like to have is immediate change to the
> documentation [1] with renaming PR . Otherwise
>
> Then need to file followup ticket to update Kostas' awesome wiki page [2].
>
> - Henry
>
> [1]
> http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
> [2]
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
>
> On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> > On a high level we call intermediate data produced by programs
> "intermediate results". For example in a WordCount map-reduce program the
> map function produces an intermediate result, which consists of (word, 1)
> pairs and the reduce function consumes this intermediate result. Kostas has
> recently added documentation explaining the core concepts [1].
> >
> > The naming of classes related to intermediate results is inconsistent
> (and probably confusing).
> >
> > - In JobGraphs (internal low-level API to define programs) they are
> called IntermediateDataSet and identified by IntermediateDataSetIDs.
> >
> > - In ExecutionGraphs (JobManager structure used for state
> tracking/scheduling) they are called IntermediateResult at the
> ExecutionJobVertex (identified by IntermediateDataSetID) and
> IntermediateResultPartition at the ExecutionVertex (identified by
> IntermediateResultPartitionID).
> >
> > - At runtime (TaskManager) they are called ResultPartition and
> identified by ResultPartitionID (composition of ExecutionAttemptID and
> IntermediateResultPartitionID). These are further subpartitioned into
> ResultSubpartition instances.
> >
> > I propose to get the naming more in line with the existing naming scheme
> and prefix it with the corresponding managemenet structures:
> >
> > 1) IntermediateDataSet => JobVertexResult (identified by
> JobVertexResultID)
> > 2) IntermediateResult => ExecutionJobVertexResult (identified by
> JobVertexResultID)
> > 3) IntermediateResultPartition => ExecutionVertexResult (identified by
> ExecutionVertexResultID)
> > 4) ResultPartition => Result
> > 5) ResultSubpartition => ResultPartition
> >
> > These names are non-user facing, but still at the core of the system. I
> think that consistent naming of these classes will make it easier for new
> contributors to get an overview of how single components relate to each
> other (the prefixes indicate this). In the docs, we can still refer to the
> high-level concept as "intermediate results".
> >
> > What's your opinion on this? I think now is a good time to think about
> this stuff, because the core classes have only been added recently to the
> system. Feel free to propose alternatives. :-)
> >
> > – Ufuk
> >
> > [1]
> https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
>

Re: [DISCUSS] Inconsistent naming of intermediate results

Posted by Henry Saputra <he...@gmail.com>.

As one of the devs that recently been tracing the runtime portion of
the code +1 for renaming for inlining with the concepts.

One thing I would like to have is immediate change to the
documentation [1] with renaming PR . Otherwise

Then need to file followup ticket to update Kostas' awesome wiki page [2].

- Henry

[1] http://ci.apache.org/projects/flink/flink-docs-master/internal_job_scheduling.html
[2] https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

On Tue, Mar 31, 2015 at 7:50 AM, Ufuk Celebi <uc...@apache.org> wrote:
> On a high level we call intermediate data produced by programs "intermediate results". For example in a WordCount map-reduce program the map function produces an intermediate result, which consists of (word, 1) pairs and the reduce function consumes this intermediate result. Kostas has recently added documentation explaining the core concepts [1].
>
> The naming of classes related to intermediate results is inconsistent (and probably confusing).
>
> - In JobGraphs (internal low-level API to define programs) they are called IntermediateDataSet and identified by IntermediateDataSetIDs.
>
> - In ExecutionGraphs (JobManager structure used for state tracking/scheduling) they are called IntermediateResult at the ExecutionJobVertex (identified by IntermediateDataSetID) and IntermediateResultPartition at the ExecutionVertex (identified by IntermediateResultPartitionID).
>
> - At runtime (TaskManager) they are called ResultPartition and identified by ResultPartitionID (composition of ExecutionAttemptID and IntermediateResultPartitionID). These are further subpartitioned into ResultSubpartition instances.
>
> I propose to get the naming more in line with the existing naming scheme and prefix it with the corresponding managemenet structures:
>
> 1) IntermediateDataSet => JobVertexResult (identified by JobVertexResultID)
> 2) IntermediateResult => ExecutionJobVertexResult (identified by JobVertexResultID)
> 3) IntermediateResultPartition => ExecutionVertexResult (identified by ExecutionVertexResultID)
> 4) ResultPartition => Result
> 5) ResultSubpartition => ResultPartition
>
> These names are non-user facing, but still at the core of the system. I think that consistent naming of these classes will make it easier for new contributors to get an overview of how single components relate to each other (the prefixes indicate this). In the docs, we can still refer to the high-level concept as "intermediate results".
>
> What's your opinion on this? I think now is a good time to think about this stuff, because the core classes have only been added recently to the system. Feel free to propose alternatives. :-)
>
> – Ufuk
>
> [1] https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks