You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Tim Schweichler <Ti...@healthination.com> on 2014/12/15 16:56:20 UTC

integrating long-running Spark jobs with Thriftserver

Hi everybody,

I apologize if the answer to my question is obvious but I haven't been able to find a straightforward solution anywhere on the internet.

I have a number of Spark jobs written using the python API that do things like read in data from Amazon S3 to a main table in the Hive metastore, perform intensive calculations on that data to build derived/aggregated tables, etc. I also have Tableau set up to read those tables via the Spark Thriftserver.

My question is how best to integrate those two sides of Spark. I want to have the Thriftserver constantly running so that Tableau can update its extracts on a scheduled basis and users can manually query those tables as needed, but I also need to run those python jobs on a scheduled basis as well. What's the best way to do that? The options I'm considering are as follows:

1. Simply call the python jobs via spark-submit, scheduled by cron. My concern here is concurrency issues if Tableau or a user tries to read from a table at the same time that a job is rebuilding/updating that table. To my understanding the Thriftserver is designed to handle concurrency, but Spark in general is not if two different Spark contexts are attempting to access the same data (as would be the case with this approach.) Am I correct in that thinking or is there actually no problem with this method?
2. Call the python jobs through the Spark Thriftserver so that the same Spark context is used. My question here is how to do that. I know one can call a python script as part of a HiveQL query using TRANSFORM, but that seems to be designed more for performing quick calculations on existing data as part of a query rather than building tables in the first place or calling long-running jobs that don't return anything (again, am I correct in this thinking or would this actually be a viable solution?) Is there a different way to call long-running Spark jobs via the Thriftserver?

Are either of these good approaches or is there a better way that I'm missing?

Thanks!

Re: integrating long-running Spark jobs with Thriftserver

Posted by Cheng Lian <li...@gmail.com>.

Hi Schweichler,

This is an interesting and practical question. I'm not familiar with how 
Tableau works, but would like to share some thoughts.

In general, big data analytics frameworks like MR and Spark tend to 
perform immutable functional transformations over immutable data. Whilst 
in your case the input table is mutated (rewritten) periodically, and 
can be potentially accessed by multiple readers/writers simultaneously, 
which introduces race conditions. A natural solution for this is to make 
the input table immutable by using a Hive table partitioned by time 
(e.g. date or hour). The updating application always append data by 
appending a new partition to the table, and Tableau always reads from 
the most recent available partition. In this way, written data are 
always immutable, thus are safe to be read at any time. Outdated data 
can be simply dropped by some other scheduled garbage collector job.

Cheng

On 12/17/14 1:35 AM, Tim Schweichler wrote:
> To ask a related question, if I use Zookeeper for table locking, will 
> this affect _all_ attempts to access the Hive tables (including those 
> from my Spark applications) or only those made through the 
> Thriftserver? In other words, does Zookeeper provide concurrency for 
> the Hive metastore in general or only for Hiveserver2/Spark's 
> Thriftserver?
>
> Thanks!
>
> From: Tim Schweichler <tim.schweichler@healthination.com 
> <ma...@healthination.com>>
> Date: Monday, December 15, 2014 at 10:56 AM
> To: "user@spark.apache.org <ma...@spark.apache.org>" 
> <user@spark.apache.org <ma...@spark.apache.org>>
> Subject: integrating long-running Spark jobs with Thriftserver
>
> Hi everybody,
>
> I apologize if the answer to my question is obvious but I haven't been 
> able to find a straightforward solution anywhere on the internet.
>
> I have a number of Spark jobs written using the python API that do 
> things like read in data from Amazon S3 to a main table in the Hive 
> metastore, perform intensive calculations on that data to build 
> derived/aggregated tables, etc. I also have Tableau set up to read 
> those tables via the Spark Thriftserver.
>
> My question is how best to integrate those two sides of Spark. I want 
> to have the Thriftserver constantly running so that Tableau can update 
> its extracts on a scheduled basis and users can manually query those 
> tables as needed, but I also need to run those python jobs on a 
> scheduled basis as well. What's the best way to do that? The options 
> I'm considering are as follows:
>
>  1. Simply call the python jobs via spark-submit, scheduled by cron.
>     My concern here is concurrency issues if Tableau or a user tries
>     to read from a table at the same time that a job is
>     rebuilding/updating that table. To my understanding the
>     Thriftserver is designed to handle concurrency, but Spark in
>     general is not if two different Spark contexts are attempting to
>     access the same data (as would be the case with this approach.) Am
>     I correct in that thinking or is there actually no problem with
>     this method?
>  2. Call the python jobs through the Spark Thriftserver so that the
>     same Spark context is used. My question here is how to do that. I
>     know one can call a python script as part of a HiveQL query using
>     TRANSFORM, but that seems to be designed more for performing quick
>     calculations on existing data as part of a query rather than
>     building tables in the first place or calling long-running jobs
>     that don't return anything (again, am I correct in this thinking
>     or would this actually be a viable solution?) Is there a different
>     way to call long-running Spark jobs via the Thriftserver?
>
> Are either of these good approaches or is there a better way that I'm 
> missing?
>
> Thanks!

Re: integrating long-running Spark jobs with Thriftserver

Posted by Tim Schweichler <Ti...@healthination.com>.

To ask a related question, if I use Zookeeper for table locking, will this affect all attempts to access the Hive tables (including those from my Spark applications) or only those made through the Thriftserver? In other words, does Zookeeper provide concurrency for the Hive metastore in general or only for Hiveserver2/Spark's Thriftserver?

Thanks!

From: Tim Schweichler <ti...@healthination.com>>
Date: Monday, December 15, 2014 at 10:56 AM
To: "user@spark.apache.org<ma...@spark.apache.org>" <us...@spark.apache.org>>
Subject: integrating long-running Spark jobs with Thriftserver

Hi everybody,

I apologize if the answer to my question is obvious but I haven't been able to find a straightforward solution anywhere on the internet.

I have a number of Spark jobs written using the python API that do things like read in data from Amazon S3 to a main table in the Hive metastore, perform intensive calculations on that data to build derived/aggregated tables, etc. I also have Tableau set up to read those tables via the Spark Thriftserver.

My question is how best to integrate those two sides of Spark. I want to have the Thriftserver constantly running so that Tableau can update its extracts on a scheduled basis and users can manually query those tables as needed, but I also need to run those python jobs on a scheduled basis as well. What's the best way to do that? The options I'm considering are as follows:

  1.  Simply call the python jobs via spark-submit, scheduled by cron. My concern here is concurrency issues if Tableau or a user tries to read from a table at the same time that a job is rebuilding/updating that table. To my understanding the Thriftserver is designed to handle concurrency, but Spark in general is not if two different Spark contexts are attempting to access the same data (as would be the case with this approach.) Am I correct in that thinking or is there actually no problem with this method?
  2.  Call the python jobs through the Spark Thriftserver so that the same Spark context is used. My question here is how to do that. I know one can call a python script as part of a HiveQL query using TRANSFORM, but that seems to be designed more for performing quick calculations on existing data as part of a query rather than building tables in the first place or calling long-running jobs that don't return anything (again, am I correct in this thinking or would this actually be a viable solution?) Is there a different way to call long-running Spark jobs via the Thriftserver?

Are either of these good approaches or is there a better way that I'm missing?

Thanks!