You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Praveen Balaji <pr...@soundhound.com> on 2015/03/18 18:22:42 UTC

Database operations on executor nodes

I was wondering what people generally do about doing database operations from executor nodes. I’m (at least for now) avoiding doing database updates from executor nodes to avoid proliferation of database connections on the cluster. The general pattern I adopt is to collect queries (or tuples) on the executors and write to the database on the driver.

// Executes on the executor
rdd.foreach(s => {
  val query = s"insert into .... ${s}";
  accumulator += query;
});

// Executes on the driver
acclumulator.value.foreach(query => {
    // get connection
    // update database
});

I’m obviously trading database connections for driver heap. How do other spark users do it?

Cheers
Praveen
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: Database operations on executor nodes

Posted by Akhil Das <ak...@sigmoidanalytics.com>.

Totally depends on your database, if that's a NoSQL database like
MongoDB/HBase etc then you can use the native .saveAsNewAPIHAdoopFile or
.saveAsHadoopDataSet etc.

For a SQL databases, i think people usually puts the overhead on driver
like you did.

Thanks
Best Regards

On Wed, Mar 18, 2015 at 10:52 PM, Praveen Balaji <pr...@soundhound.com>
wrote:

> I was wondering what people generally do about doing database operations
> from executor nodes. I’m (at least for now) avoiding doing database updates
> from executor nodes to avoid proliferation of database connections on the
> cluster. The general pattern I adopt is to collect queries (or tuples) on
> the executors and write to the database on the driver.
>
> // Executes on the executor
> rdd.foreach(s => {
>   val query = s"insert into .... ${s}";
>   accumulator += query;
> });
>
> // Executes on the driver
> acclumulator.value.foreach(query => {
>     // get connection
>     // update database
> });
>
> I’m obviously trading database connections for driver heap. How do other
> spark users do it?
>
> Cheers
> Praveen
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>