You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Saulo Ricci <in...@gmail.com> on 2017/05/01 14:50:49 UTC

Reading table from sql database to apache spark dataframe/RDD

Hi,


I have the following code that is reading a table to a apache spark
DataFrame:

 val df = spark.read.format("jdbc")
     .option("url","jdbc:postgresql:host/database")
     .option("dbtable","tablename").option("user","username")
     .option("password", "password")
     .load()

When I first invoke df.count() I get a smaller number than the next time I
invoke the same count method.

Why this happen?

Doesn't Spark load a snapshot of my table in a DataFrame on my Spark
Cluster when I first read that table?

My table on postgres keeps being fed and it seems my data frame is
reflecting this behavior.

How should I manage to load just a static snapshot my table to spark's
DataFrame by the time `read` method was invoked?


Any help is appreciated,

-- 
Saulo

Re: Reading table from sql database to apache spark dataframe/RDD

Posted by vincent gromakowski <vi...@gmail.com>.
Use cache or persist. The dataframe will be materialized when the 1st
action is called and then be reused from memory for each following usage

Le 1 mai 2017 4:51 PM, "Saulo Ricci" <in...@gmail.com> a écrit :

> Hi,
>
>
> I have the following code that is reading a table to a apache spark
> DataFrame:
>
>  val df = spark.read.format("jdbc")
>      .option("url","jdbc:postgresql:host/database")
>      .option("dbtable","tablename").option("user","username")
>      .option("password", "password")
>      .load()
>
> When I first invoke df.count() I get a smaller number than the next time
> I invoke the same count method.
>
> Why this happen?
>
> Doesn't Spark load a snapshot of my table in a DataFrame on my Spark
> Cluster when I first read that table?
>
> My table on postgres keeps being fed and it seems my data frame is
> reflecting this behavior.
>
> How should I manage to load just a static snapshot my table to spark's
> DataFrame by the time `read` method was invoked?
>
>
> Any help is appreciated,
>
> --
> Saulo
>