You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Ziyad Muhammed <mm...@gmail.com> on 2017/07/08 02:27:26 UTC

FlinkML ALS is taking too long to run

Dear all

I'm trying to run Flink ALS against Yahoo-R2 data set[1] on HDFS. The
program is running without showing any errors, but it does not finish. The
operators running indefinitely are:

CoGroup (CoGroup at
org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:606))(11/240)

Join(Join at
org.apache.flink.ml.recommendation.ALS$.updateFactors(ALS.scala:576))(15/240)


I was using the below parameters to run:

val als = ALS().setIterations(10).setNumFactors(10).setBlocks(100)

And I didn't set the hdfs temporary path. Can someone tell me the
parameters to set to run ALS on such large data sets? Why are these
operators running indefinitely?

[1] https://webscope.sandbox.yahoo.com/catalog.php?datatype=r

Best
Ziyad

Re: FlinkML ALS is taking too long to run

Posted by Sebastian Schelter <ss...@googlemail.com>.

I don't think you need to employ a distributed system for working with this
dataset. An SGD implementation on a single machine should easily handle the
job.

Best,
Sebastian

2017-07-12 9:26 GMT+02:00 Andrea Spina <an...@radicalbit.io>:

> Dear Ziyad,
>
> Yep, I had encountered same very long runtimes with ALS as well at the time
> and I recorded improvements by increasing the number of blocks / decreasing
> #TSs/TM like you've stated out.
>
> Cheers,
>
> Andrea
>
>
>
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-
> taking-too-long-to-run-tp14154p14192.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: FlinkML ALS is taking too long to run

Posted by Andrea Spina <an...@radicalbit.io>.

Dear Ziyad, 

Yep, I had encountered same very long runtimes with ALS as well at the time
and I recorded improvements by increasing the number of blocks / decreasing
#TSs/TM like you've stated out.

Cheers,

Andrea






--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14192.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.

Re: FlinkML ALS is taking too long to run

Posted by Ziyad Muhammed <mm...@gmail.com>.

Dear Andrea

Thank you for your reply.
The job was stuck at two operators I mentioned (for more than 17 hours).
See the screenshot.

I could solve the problem by:
1. Reducing the task slots in the cluster (to half the number of cores from
same as the number of cores)
2. Tuning the hyper parameter 'blocks'. I kept it at double the value of
job parallelism.

Best
Ziyad

On Tue, Jul 11, 2017 at 5:53 PM, Andrea Spina <an...@radicalbit.io>
wrote:

> Dear Ziyad,
> could you kindly share some additional info about your environment
> (local/cluster, nodes, machines' configuration)?
> What does exactly you mean by "indefinitely"? How much time the job is
> hanging?
>
> Hope to help you, then.
>
> Cheers,
>
> Andrea
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-
> taking-too-long-to-run-tp14154p14186.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Re: FlinkML ALS is taking too long to run

Posted by Andrea Spina <an...@radicalbit.io>.

Dear Ziyad,
could you kindly share some additional info about your environment
(local/cluster, nodes, machines' configuration)?
What does exactly you mean by "indefinitely"? How much time the job is
hanging?

Hope to help you, then.

Cheers,

Andrea



--
View this message in context: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/FlinkML-ALS-is-taking-too-long-to-run-tp14154p14186.html
Sent from the Apache Flink User Mailing List archive. mailing list archive at Nabble.com.