You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Andy Huang <an...@servian.com.au> on 2015/09/23 07:03:39 UTC

Fwd: Parallel collection in driver programs

Hi Devs,

Hopefully one of you know more on this?

Thanks

Andy
---------- Forwarded message ----------
From: Andy Huang <an...@servian.com.au>
Date: Wed, Sep 23, 2015 at 12:39 PM
Subject: Parallel collection in driver programs
To: user@spark.apache.org


Hi All,

Would like know if anyone has experienced with parallel collection in the
driver program. And, if there is actual advantage/disadvantage of doing so.

E.g. With a collection of Jdbc connections and tables

We have adapted our non-spark code which utilize parallel collection to the
spark code and it seems to work fine.

val conf = List(
  ("tbl1","dbo.tbl1::tb1_id::0::127::128"),
  ("tbl2","dbo.tbl2::tb2_id::0::31::32"),
  ("tbl3","dbo.tbl3::tb3_id::0::63::64")
)

val _JDBC_DEFAULT = "jdbc:sqlserver://192.168.52.1;database=TestSource"
val _STORE_DEFAULT = "hdfs://192.168.52.132:9000/"

val prop = new Properties()
prop.setProperty("user","sa")
prop.setProperty("password","password")

conf.par.map(pair=>{

  val qry = pair._2.split("::")(0)
  val pCol = pair._2.split("::")(1)
  val lo = pair._2.split("::")(2).toInt
  val hi = pair._2.split("::")(3).toInt
  val part = pair._2.split("::")(4).toInt

  //create dataframe from jdbc table
  val jdbcDF = sqlContext.read.jdbc(
    _JDBC_DEFAULT,
    "("+qry+") a",
    pCol,
    lo, //lower bound
    hi, //upper bound
    part, //number of partitions
    prop //java.utils.Properties - key value pair
  )

  //save to parquet
  jdbcDF.write.mode("overwrite").parquet(_STORE_DEFAULT+pair._1+".parquet")

})


Thanks.
-- 
Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
f: 02 9376 0730| m: 0433221979



-- 
Andy Huang | Managing Consultant | Servian Pty Ltd | t: 02 9376 0700 |
f: 02 9376 0730| m: 0433221979