You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by nitin <ni...@gmail.com> on 2015/02/21 17:55:49 UTC

Spark SQL - Long running job

Hi All,

I intend to build a long running spark application which fetches data/tuples
from parquet, does some processing(time consuming) and then cache the
processed table (InMemoryColumnarTableScan). My use case is good retrieval
time for SQL query(benefits of Spark SQL optimizer) and data
compression(in-built in in-memory caching). Now the problem is that if my
driver goes down, I will have to fetch the data again for all the tables and
compute it and cache which is time consuming.

Is it possible to persist processed/cached RDDs on disk such that my system
up time is less when restarted after failure/going down?

On a side note, the data processing contains a shuffle step which creates
huge temporary shuffle files on local disk in temp folder and as per current
logic, shuffle files don't get deleted for running executors. This is
leading to my local disk getting filled up quickly and going out of space as
its a long running spark job. (running spark in yarn-client mode btw).

Thanks
-Nitin 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spark SQL - Long running job

Posted by Cheng Lian <li...@gmail.com>.

I meant using |saveAsParquetFile|. As for partition number, you can 
always control it with |spark.sql.shuffle.partitions| property.

Cheng

On 2/23/15 1:38 PM, nitin wrote:

> I believe calling processedSchemaRdd.persist(DISK) and
> processedSchemaRdd.checkpoint() only persists data and I will lose all the
> RDD metadata and when I re-start my driver, that data is kind of useless for
> me (correct me if I am wrong).
>
> I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
> but I fear that in case my "HDFS block size" > "partition file size", I will
> get more partitions when reading instead of original schemaRdd.
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>

Re: Spark SQL - Long running job

Posted by nitin <ni...@gmail.com>.

I believe calling processedSchemaRdd.persist(DISK) and
processedSchemaRdd.checkpoint() only persists data and I will lose all the
RDD metadata and when I re-start my driver, that data is kind of useless for
me (correct me if I am wrong).

I thought of doing processedSchemaRdd.saveAsParquetFile (hdfs file system)
but I fear that in case my "HDFS block size" > "partition file size", I will
get more partitions when reading instead of original schemaRdd. 



--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717p10727.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org

Re: Spark SQL - Long running job

Posted by Cheng Lian <li...@gmail.com>.

How about persisting the computed result table first before caching it? 
So that you only need to cache the result table after restarting your 
service without recomputing it. Somewhat like checkpointing.

Cheng

On 2/22/15 12:55 AM, nitin wrote:
> Hi All,
>
> I intend to build a long running spark application which fetches data/tuples
> from parquet, does some processing(time consuming) and then cache the
> processed table (InMemoryColumnarTableScan). My use case is good retrieval
> time for SQL query(benefits of Spark SQL optimizer) and data
> compression(in-built in in-memory caching). Now the problem is that if my
> driver goes down, I will have to fetch the data again for all the tables and
> compute it and cache which is time consuming.
>
> Is it possible to persist processed/cached RDDs on disk such that my system
> up time is less when restarted after failure/going down?
>
> On a side note, the data processing contains a shuffle step which creates
> huge temporary shuffle files on local disk in temp folder and as per current
> logic, shuffle files don't get deleted for running executors. This is
> leading to my local disk getting filled up quickly and going out of space as
> its a long running spark job. (running spark in yarn-client mode btw).
>
> Thanks
> -Nitin
>
>
>
> --
> View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Spark-SQL-Long-running-job-tp10717.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
> For additional commands, e-mail: dev-help@spark.apache.org
>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org