You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@spark.apache.org by Yang Cao <cy...@gmail.com> on 2016/09/26 15:22:31 UTC

increase efficiency of working with mongo and mysql database

Dear all,

I am currently working with spark 1.6.2, mongodb and mysql. I am stuck with the performance problem. The working scenario is that reading data from mongo to spark and then do some counting work get results(several rows), write to mysql database. With pseudocode：

val offset = …
val mongoDF = getMongoDF by strait package(0.11.0).filter(based on offset)

val resDF = doing counting job based on mongoDF

resDF.write().jdbc(info of connection)

Logic is quite simple. But after several test, I found the efficiency of loading from mongo and saving to mysql become bottleneck of my application.

For the job of reading data from mongo, I find it always split into 2 tasks. The first is one is flatMap at MongodbSchema.scala:41 and the second one is aggregate at MongodbSchema.scala:47. In my situation, it looks like this:

It shows that in first step, it only get one task and one executor, which will be extremely slow in working with collection in billions rows. Sometime, it will take 1hr in first step but only several seconds in second.

While in jdbc side, it is similar, saving process also in two steps, one with one task and other with 200, which in DataFrameWriter.scala:311 .

So my application always get stuck in the stage with only one task. My cluster has free resource and my mongo server also get idle resources. Can someone explain that why these stages only get one executor? Is there any suggestion to speed up the stages?

I have set the configuration, spark.default.parallelism 400. It looks not help.

Need suggestion. THX.

Best,

Matthew Cao