You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Alger Remirata (JIRA)" <ji...@apache.org> on 2016/01/06 17:42:39 UTC
[jira] [Comment Edited] (SPARK-5955) Add checkpointInterval to ALS

    [ https://issues.apache.org/jira/browse/SPARK-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085794#comment-15085794 ] 

Alger Remirata edited comment on SPARK-5955 at 1/6/16 4:42 PM:
---------------------------------------------------------------

Hi Xiangrui Meng, 

First of all, I would like to thank you guys for developing spark and putting it open source that we can use. I'm Alger Remirata, a researcher from the Philippines. I'm new to Spark and Scala, and working in a project involving matrix factorizations in Spark. I have a problem regarding running ALS in Spark. It has a stackoverflow due to long linage chain as per comments on the internet. One of their suggestion is to use the setCheckpointInterval so that for every 10-20 iterations, we can checkpoint the RDDs and it prevents the error. Just want to ask details on how to do checkpointing with ALS. I am using spark-kernel developed by IBM: https://github.com/ibm-et/spark-kernel instead of spark-shell.

Here are some of my specific questions regarding details on checkpoint:

1. In setting checkpoint directory through SparkContext.setCheckPointDir(), it needs to be a hadoop compatible directory. Can we use any available hdfs-compatible directory?
2. What do you mean by this comment on the code in ALS checkpointing:
If the checkpoint directory is not set in [[org.apache.spark.SparkContext]],
  * this setting is ignored.
3. Is the use of setCheckPointInterval the only code I needed to add to have checkpointing for ALS work?
4. I am getting this error: Name: java.lang.IllegalArgumentException, Message: Wrong FS: expected file :///. How can I solve this? What is the proper way of using checkpointing.

Thanks a lot!


was (Author: aremirata):
Hi Xiangrui Meng, 

First of all, I would like to thank you guys for developing spark and putting it open source that we can use. I'm Alger Remirata, a researcher from the Philippines. I'm new to Spark and Scala, and working in a project involving matrix factorizations in Spark. I have a problem regarding running ALS in Spark. It has a stackoverflow due to long linage chain as per comments on the internet. One of their suggestion is to use the setCheckpointInterval so that for every 10-20 iterations, we can checkpoint the RDDs and it prevents the error. Just want to ask details on how to do checkpointing with ALS. I am using spark-kernel developed by IBM: https://github.com/ibm-et/spark-kernel instead of spark-shell.

Here are some of my specific questions regarding details on checkpoint:

1. In setting checkpoint directory through SparkContext.setCheckPointDir(), it needs to be a hadoop compatible directory. Can we use any available hdfs-compatible directory?
2. What do you mean by this comment on the code in ALS checkpointing:
If the checkpoint directory is not set in [[org.apache.spark.SparkContext]],
  * this setting is ignored.
3. Is the use of setCheckPointInterval the only code I needed to add to have checkpointing for ALS work?
4. I am getting this error: Name: java.lang.IllegalArgumentException, Message: Wrong FS: expected file :///. How can I solve this? What is the proper way of using checkpointing.

Thanks a lot!

> Add checkpointInterval to ALS
> -----------------------------
>
>                 Key: SPARK-5955
>                 URL: https://issues.apache.org/jira/browse/SPARK-5955
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML, MLlib
>    Affects Versions: 1.3.0
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>             Fix For: 1.3.1, 1.4.0
>
>
> We should add checkpoint interval to ALS to prevent the following:
> 1. storing large shuffle files
> 2. stack overflow (SPARK-1106)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org