You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Gang Li <lg...@gmail.com> on 2020/09/24 13:52:20 UTC

Let multiple jobs share one rdd？

Hi all,

There are three jobs, among which the first rdd is the same. Can the first
rdd be calculated once, and then the subsequent operations will be
calculated in parallel?

<http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png> 

My code is as follows:

sqls = ["""
            INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
(dt='20200917')
            SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey
            FROM temp_table where status=3""",
            """ 
            INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
(dt='20200917')
            SELECT id, cur_inst_id, status, update_time, schedule_time,
task_name
            FROM temp_table where schedule_time > '2020-09-01 00:00:00' 
            """]

def multi_thread():
    sql = """SELECT id, plan_id, status, retry_times, start_time,
schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
scheduler_id
        FROM table
        where dt < '20200801'"""
    data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
    threads = []
    for i in range(2):
        try:
            t = threading.Thread(target=insert_overwrite_thread,
args=(sqls[i], data, ))
            t.start()
            threads.append(t)
        except Exception as x:
            print x
    for t in threads:
        t.join()

def insert_overwrite_thread(sql, data):
    data.createOrReplaceTempView('temp_table')
    spark.sql(sql)



Since spark is in lazy mode, the first RDD will still be calculated multiple
times during parallel submission.
I would like to ask you if there are other ways, thanks！

Cheers,
Gang Li



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: Let multiple jobs share one rdd？

Posted by Khalid Mammadov <kh...@gmail.com>.

Perhaps you can use Global Temp Views?

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrame.createGlobalTempView


On 24/09/2020 14:52, Gang Li wrote:
> Hi all,
>
> There are three jobs, among which the first rdd is the same. Can the first
> rdd be calculated once, and then the subsequent operations will be
> calculated in parallel?
>
> <http://apache-spark-user-list.1001560.n3.nabble.com/file/t11001/1600954813075-image.png>
>
> My code is as follows:
>
> sqls = ["""
>              INSERT OVERWRITE TABLE `spark_input_test3` PARTITION
> (dt='20200917')
>              SELECT id, plan_id, status, retry_times, start_time,
> schedule_datekey
>              FROM temp_table where status=3""",
>              """
>              INSERT OVERWRITE TABLE `spark_input_test4` PARTITION
> (dt='20200917')
>              SELECT id, cur_inst_id, status, update_time, schedule_time,
> task_name
>              FROM temp_table where schedule_time > '2020-09-01 00:00:00'
>              """]
>
> def multi_thread():
>      sql = """SELECT id, plan_id, status, retry_times, start_time,
> schedule_datekey, task_name, update_time, schedule_time, cur_inst_id,
> scheduler_id
>          FROM table
>          where dt < '20200801'"""
>      data = spark.sql(sql).persist(StorageLevel.MEMORY_AND_DISK)
>      threads = []
>      for i in range(2):
>          try:
>              t = threading.Thread(target=insert_overwrite_thread,
> args=(sqls[i], data, ))
>              t.start()
>              threads.append(t)
>          except Exception as x:
>              print x
>      for t in threads:
>          t.join()
>
> def insert_overwrite_thread(sql, data):
>      data.createOrReplaceTempView('temp_table')
>      spark.sql(sql)
>
>
>
> Since spark is in lazy mode, the first RDD will still be calculated multiple
> times during parallel submission.
> I would like to ask you if there are other ways, thanks！
>
> Cheers,
> Gang Li
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org