You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Roger Marin <ro...@rogersmarin.com> on 2016/02/29 11:42:35 UTC
What is the best approach to perform concurrent updates from
different jobs to a in memory dataframe registered as a temp table?
Hi all,
I have multiple (>100) jobs running concurrently (sharing the same hive
context) that are each appending new rows to the same dataframe registered
as a temp table.
Currently I am using unionAll and registering that dataframe again as a
temp table in each job:
Given an existing dataframe registered as the temp table "test":
//Create dataframe with new rows to append
val newRows = hiveContext.createDataframe (rows,schema)
//Retrieve existing dataframe and append the new dataframe via unionAll
val updatedDF=hiveContext.table("test").unionAll(newRows)
//uncache existing dataframe
hiveContext.uncacheTable("test")
//Register the updated DF as a temp table
updatedDF.registerTempTable("test")
//Cache the updated dataframe
hiveContext.table("test").cache
I am finding that using this approach can deplete memory very quickly since
each call to ".cache" in each of the jobs is creating a new entry in memory
for the same dataframe.
Does anyone know if theres a more optimal solution to the above?.
Thanks,
Roger