You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by naliazheli <75...@qq.com> on 2016/05/25 01:43:10 UTC
job build cost more and more time
i am using spark1.6 and noticed time between jobs get longer,sometimes it
could be 20 mins.
i tried to search same questions ,and found a close one :
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146
and found something useful:
One thing to worry about is long-running jobs or shells. Currently, state
buildup of a single job in Spark is a problem, as certain state such as
shuffle files and RDD metadata is not cleaned up until the job (or shell)
exits. We have hacky ways to reduce this, and are working on a long term
solution. However, separate, consecutive jobs should be independent in terms
of performance.
On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[hidden email]> wrote:
Is your spark app an iterative one ? If so, your app is creating a big DAG
in every iteration. You should use checkpoint it periodically, say, 10
iterations one checkpoint.
i also wrote a test program,there is the code:
public static void newJob(int jobNum,SQLContext sqlContext){
for(int i=0;i<jobNum;i++){
testJob(i,sqlContext);
}
}
public static void testJob(int jobNum,SQLContext sqlContext){
String test_sql =" SELECT a.* FROM income a";
DataFrame test_df = sqlContext.sql(test_sql);
test_df.registerTempTable("income");
test_df.cache();
test_df.count();
test_df.show();
}
}
function newJob(100,sqlContext) could reproduce my issue,job build cost
more and more time .
DataFrame without close api like checkpoint.
Is there anothor way to resolve it?
--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/job-build-cost-more-and-more-time-tp27017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org
Re: job build cost more and more time
Posted by nguyen duc tuan <ne...@gmail.com>.
Take a look in here:
http://stackoverflow.com/questions/33424445/is-there-a-way-to-checkpoint-apache-spark-dataframes
So all you have to do create a checkpoint for a dataframe is as follow:
df.rdd.checkpoint
df.rdd.count // or any action
2016-05-25 8:43 GMT+07:00 naliazheli <75...@qq.com>:
> i am using spark1.6 and noticed time between jobs get longer,sometimes it
> could be 20 mins.
> i tried to search same questions ,and found a close one :
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-app-gets-slower-as-it-gets-executed-more-times-td1089.html#a1146
>
> and found something useful:
> One thing to worry about is long-running jobs or shells. Currently, state
> buildup of a single job in Spark is a problem, as certain state such as
> shuffle files and RDD metadata is not cleaned up until the job (or shell)
> exits. We have hacky ways to reduce this, and are working on a long term
> solution. However, separate, consecutive jobs should be independent in
> terms
> of performance.
>
>
> On Sat, Feb 1, 2014 at 8:27 PM, 尹绪森 <[hidden email]> wrote:
> Is your spark app an iterative one ? If so, your app is creating a big DAG
> in every iteration. You should use checkpoint it periodically, say, 10
> iterations one checkpoint.
>
> i also wrote a test program,there is the code:
>
> public static void newJob(int jobNum,SQLContext sqlContext){
> for(int i=0;i<jobNum;i++){
> testJob(i,sqlContext);
> }
> }
>
>
> public static void testJob(int jobNum,SQLContext sqlContext){
> String test_sql =" SELECT a.* FROM income a";
> DataFrame test_df = sqlContext.sql(test_sql);
> test_df.registerTempTable("income");
> test_df.cache();
> test_df.count();
> test_df.show();
> }
> }
>
> function newJob(100,sqlContext) could reproduce my issue,job build cost
> more and more time .
> DataFrame without close api like checkpoint.
> Is there anothor way to resolve it?
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/job-build-cost-more-and-more-time-tp27017.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>