You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Roger Marin <ro...@rogersmarin.com> on 2016/02/29 11:42:35 UTC

What is the best approach to perform concurrent updates from different jobs to a in memory dataframe registered as a temp table?

Hi all,

I have multiple (>100) jobs running concurrently (sharing the same hive
context) that are each appending new rows to the same dataframe registered
as a temp table.

Currently I am using unionAll and registering that dataframe again as a
temp table in each job:

Given an existing dataframe registered as the temp table "test":

//Create dataframe with new rows to append
val newRows = hiveContext.createDataframe (rows,schema)

//Retrieve existing dataframe and append the new dataframe via unionAll
val updatedDF=hiveContext.table("test").unionAll(newRows)

//uncache existing dataframe
hiveContext.uncacheTable("test")

//Register the updated DF as a temp table
updatedDF.registerTempTable("test")

//Cache the updated dataframe
hiveContext.table("test").cache

I am finding that using this approach can deplete memory very quickly since
each call to ".cache" in each of the jobs is creating a new entry in memory
for the same dataframe.

Does anyone know if theres a more optimal solution to the above?.

Thanks,
Roger