You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Joris Billen <jo...@bigindustries.be> on 2023/01/31 15:35:35 UTC

[Spark/deeplyR] how come spark is caching tables read through jdbc connection from oracle, even when memory=false is chosen

This question is related to using Spark and deeplyR.
We load a lot of data from oracle in dataframes through a jdbc connection:

dfX <- spark_read_jdbc(spConn, “myconnection", 
            options = list(
                    url = urlDEVdb,
                    driver = "oracle.jdbc.OracleDriver",
                    user = dbt_schema,
                    password = dbt_password,
                    dbtable = pQuery,
                    memory = FALSE # don't cache the whole (big) table
            ))

Then we do a lot of sql statemsnts, and use sdf_register to register the results. Eventually we want to write the final result to a db. 

Although we have set memory=FALSE, we see all these tables get cached. I notice that counts are triggered (I think this happens before a table is ccahed) and a collect is triggered. Also we think we see that when the tables are registered with sdf_register, looks like it triggers a collect action (almost looks like these are also cached). This leads to a lot of actions (often on the dataframes resulting from the same pipeline) which takes a long time.

Questions to people using deeplyR+spark:
1) Is it possible that this memory =false is ignored when reading through jdbc? 
2) can someone confirm that there is a lot of automatic caching happening (and hence a lot of counts and a lot of actions)?


Thanks for input!


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org