You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@livy.apache.org by Wandong Wu <wu...@husky.neu.edu> on 2018/07/11 09:46:46 UTC

Some questions about cached data in Livy

Dear Sir or Madam:
 
      I am a Livy beginner. I use Livy, because within an interactive session, different spark jobs could share cached RDDs or DataFrames.
 
      When I read some parquet files and create a table called “TmpTable”. The following queries will use this table. Does it mean this table has been cached?
      If cached, where is the table cached? The table is cached in Livy or Spark cluster?
 
      Spark also supports cache function.  When I read some parquet files and create a table called “TmpTable2”. I add such code: sql_ctx.cacheTable('tmpTable2').
      In the next query using this table. It will be cached in Spark cluster. Then the following queries could use this cached table.
 
      What is the difference between cached in Livy and cached in Spark cluster?
 
Thanks!
 
Yours
Wandong
 

Re: Some questions about cached data in Livy

Posted by Wandong Wu <wu...@husky.neu.edu>.
Hi:

	I’m sorry that I’m still confused about Livy's shared object mechanism. As for our product design. the shared object is a DataFrame or  table which is created from some parquet files in AWS S3. When I use Python programmatic API to submit a job like this:
(1) I don’t cache df and tmpTable into spark memory.
def job1(context):
    spark = context.spark_session
    df = spark.read.parquet('s3://stg-elasticlog-backup-ex4/type=facet_bandwidth_total/ts_hour=41784*/*/*.parquet') \
         .select('in_byte', 'out_byte', 'user_name').registerTempTable('tmpTable')


(2) Then, I will reuse the “tmpTable” to do some SQL queries, like:
def job4(context):
    spark = context.spark_session
    tmpRes3 = spark.sql("select sum(in_byte) as sum_in, user_name from tmpTable \
                           group by user_name \
                           having sum_in > 1000000000 \
                           order by sum_in").toJSON().collect()
    return tmpRes3

(3) Then, I do another query:
def job4(context):
    session = context.spark_session
    tmpRes3 = session.sql("select sum(out_byte) as sum_out, user_name from tmpTable \
                           group by user_name \
                           having sum_out > 1000000000 \
                           order by sum_out").toJSON().collect()
    return tmpRes3
My understanding is : For (1), It will find all the parquet files path and create a DataFrame, then create a table. But those will not be cached into Spark memory. For (2) and (3), every time we use “tmpTable” it knows what “tmpTable” is, but it will re-find all the files to re-build this table? Is that right?


As for Livy's shared object mechanism. You said "user could store this object Foo into JobContext with a provided name”. So, it will store the data into memory, or just store the reference to the data?  Maybe my question confuses you……

In our previous design, we use AWS s3 and Athena. For a report, about 20 SQL queries use the same files in AWS s3, but every query must re-load the files. So, it will be very slow. We want to speed up these queries. 

Because these queries use the same files, we actually only need load the files just once.  Can Livy's shared object mechanism help us? or we need use Spark cache mechanism? 


Thank you so much!


Yours
Wandong


> 在 2018年7月11日,20:17,Saisai Shao <sa...@gmail.com> 写道:
> 
> Hi Wandong,
> 
> Livy's shared object mechanism mainly used to share objects between different Livy jobs, this is mainly used for Job API. For example job A create a object Foo which wants to be accessed by Job B, then user could store this object Foo into JobContext with a provided name, after that Job B could get this object by the name.
> 
> This is different from Spark's cache mechanism. What you mentioned above (tmp table) is a Spark provided table cache mechanism, which is unrelated to Livy.
> 
> 
> 
> Wandong Wu <wu.wa@husky.neu.edu <ma...@husky.neu.edu>> 于2018年7月11日周三 下午5:46写道:
> Dear Sir or Madam:
>  
>       I am a Livy beginner. I use Livy, because within an interactive session, different spark jobs could share cached RDDs or DataFrames.
>  
>       When I read some parquet files and create a table called “TmpTable”. The following queries will use this table. Does it mean this table has been cached?
>       If cached, where is the table cached? The table is cached in Livy or Spark cluster?
>  
>       Spark also supports cache function.  When I read some parquet files and create a table called “TmpTable2”. I add such code: sql_ctx.cacheTable('tmpTable2').
>       In the next query using this table. It will be cached in Spark cluster. Then the following queries could use this cached table.
>  
>       What is the difference between cached in Livy and cached in Spark cluster?
>  
> Thanks!
>  
> Yours
> Wandong
>  


Re: Some questions about cached data in Livy

Posted by Saisai Shao <sa...@gmail.com>.
Hi Wandong,

Livy's shared object mechanism mainly used to share objects between
different Livy jobs, this is mainly used for Job API. For example job A
create a object Foo which wants to be accessed by Job B, then user could
store this object Foo into JobContext with a provided name, after that Job
B could get this object by the name.

This is different from Spark's cache mechanism. What you mentioned above
(tmp table) is a Spark provided table cache mechanism, which is unrelated
to Livy.



Wandong Wu <wu...@husky.neu.edu> 于2018年7月11日周三 下午5:46写道:

> Dear Sir or Madam:
>
>       I am a Livy beginner. I use Livy, because within an interactive
> session, different spark jobs could share cached RDDs or DataFrames.
>
>       When I read some parquet files and create a table called “TmpTable”.
> The following queries will use this table. Does it mean this table has been
> cached?
>       If cached, where is the table cached? The table is cached in Livy or
> Spark cluster?
>
>       Spark also supports cache function.  When I read some parquet files
> and create a table called “TmpTable2”. I add such code: sql_ctx.cacheTable(
> *'tmpTable2'*).
>       In the next query using this table. It will be cached in Spark
> cluster. Then the following queries could use this cached table.
>
>       What is the difference between cached in Livy and cached in Spark
> cluster?
>
> Thanks!
>
> Yours
> Wandong
>
>