You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Vasiliki Kalavri <va...@gmail.com> on 2014/07/10 14:57:00 UTC

Fixpoint api

Hello all,

I've been recently working on a flink api for iterative fixpoint programs
and I would like to consult with you, regarding some issues :)

The idea is to provide an easy way for users to write iterative programs of
the form x = f(x, D), where x is a set of parameters to be refined, f is
the step function and D is the set of dependencies among parameters.
>From the user point of view, the api looks a lot like Spargel and the user
only needs to provide the initial state of the parameters, the dependencies
dataset, an implementation of the step function and the maximum number of
iterations.
In the background, I spawn a Bulk iteration and while it's running, I
approximate the cost of an equivalent Delta iteration plan. As the fixpoint
problem gets closer to convergence, this cost diminishes, until it
(probably) becomes smaller than the Bulk iteration cost. This is
implemented inside a convergence criterion and when it is met, the
execution switches to the Delta iteration plan until convergence or until
the maximum number of iterations is reached.

You can find the current state of the code in my local branch, under
stratosphere-fixpoint:
https://github.com/vasia/incubator-flink/tree/cost_model

One of the problems I have is that when the Bulk iteration cost convergence
criterion is met, I cannot find a way to figure out how many iterations
have been executed until then.
If I use an aggregator, I can only get the result after both the Bulk and
Delta iterations have finished (they belong to the same job) and if I use
an iteration aggregator, I cannot access its value after the Bulk iteration
is finished. Any ideas on possible solutions?

The second issue is that the fixpoint api should also support dependencies
datasets with or without weights (Tuple2 or Tuple3).
Similarly to Spargel, I currently provide 2 separate methods for this and
consequently 2 FixpointIteration constructors internally and duplicate
methods to support both cases.
Any idea on how I could handle this more elegantly?

Thanks!

Cheers,
V.

Re: Fixpoint api

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hey,

sure, executing this as separate jobs works fine!
For now, I just wanted to make sure I'm not missing something obvious. When
ready for experiments, I will revisit this and try to avoid the file system.

Is there any description or issue related to the incremental rollout? Let
me know if I can help in any way!

Thanks,
V.




On 11 July 2014 18:01, Kostas Tzoumas <ko...@tu-berlin.de> wrote:

> As a first prototype, you can (using env.execute):
>
> - execute the bulk iteration as one job
> - grab the number of elapsed iterations using the static variable that
> Stephan mentioned
> - execute the delta iteration as a second job if needed
>
> On Fri, Jul 11, 2014 at 5:06 PM, Stephan Ewen <se...@apache.org> wrote:
>
> > Hi!
> >
> > That is exactly the incremental rollout variant.
> >
> > You can do this with intermediate file writing, currently. Would that be
> a
> > way to start prototyping? For benchmarking, we can even hack a caching
> > output and input format that keeps the data in the task manager's memory.
> >
> > Stephan
> >
>

Re: Fixpoint api

Posted by Kostas Tzoumas <ko...@tu-berlin.de>.

As a first prototype, you can (using env.execute):

- execute the bulk iteration as one job
- grab the number of elapsed iterations using the static variable that
Stephan mentioned
- execute the delta iteration as a second job if needed

On Fri, Jul 11, 2014 at 5:06 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> That is exactly the incremental rollout variant.
>
> You can do this with intermediate file writing, currently. Would that be a
> way to start prototyping? For benchmarking, we can even hack a caching
> output and input format that keeps the data in the task manager's memory.
>
> Stephan
>

Re: Fixpoint api

Posted by Stephan Ewen <se...@apache.org>.

Hi!

That is exactly the incremental rollout variant.

You can do this with intermediate file writing, currently. Would that be a
way to start prototyping? For benchmarking, we can even hack a caching
output and input format that keeps the data in the task manager's memory.

Stephan

Re: Fixpoint api

Posted by Vasiliki Kalavri <va...@gmail.com>.

Hey,

thanks for the reply!

My delta iteration doesn't have a convergence criterion (other than the
default empty workset that is).
If I'm not mistaken, one can't define a custom convergence criterion in a
delta iteration.
I think I had opened an issue for this at some point, but it was low
priority and we didn't decide on it.

Even if I do define a custom convergence criterion though, that would mean
that I will need to run at least one delta iteration, even if the maximum
number of iterations has already been reached.

What I am looking for is more something like the following:

int iterationsLeft;

bulkResult = doBulkIterations(parameters, dependencies, stepFunction,
maxIterations); //somehow sets iterationsLeft
if (iterationsLeft > 0 ) {
  doDeltaIterations(bulkResult, dependencies, stepFunction, iterationsLeft);
}

Cheers,
V.

On 11 July 2014 14:25, Stephan Ewen <se...@apache.org> wrote:

> Hey Vasia!
>
> Nice work, it looks really cool!
>
>
> Concerning the problem of figuring out the number of iterations:
>
>  - In the long run we can save this with the incremental rollout that we
> are working on
>
>  - As a temporary workaround, you could store the number of elapsed
> iterations in a static variable somewhere. You set it from the "open()" or
> "close()" function of one of the UDFs. The convergence criterion for the
> delta iterations can pick it up from there (access the static variable).
>
>
> Concerning the edges with / without weights: Can you use a custom data type
> and have the weight "null" when they are not set/used?
>
> Greetings,
> Stephan
>

Re: Fixpoint api

Posted by Stephan Ewen <se...@apache.org>.

Hey Vasia!

Nice work, it looks really cool!


Concerning the problem of figuring out the number of iterations:

 - In the long run we can save this with the incremental rollout that we
are working on

 - As a temporary workaround, you could store the number of elapsed
iterations in a static variable somewhere. You set it from the "open()" or
"close()" function of one of the UDFs. The convergence criterion for the
delta iterations can pick it up from there (access the static variable).


Concerning the edges with / without weights: Can you use a custom data type
and have the weight "null" when they are not set/used?

Greetings,
Stephan