You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by "A.K.M. Ashrafuzzaman" <as...@gmail.com> on 2015/03/08 07:54:09 UTC

Bulk insert strategy

While processing DStream in the Spark Programming Guide, the suggested usage of connection is the following,

dstream.foreachRDD(rdd => {
      rdd.foreachPartition(partitionOfRecords => {
          // ConnectionPool is a static, lazily initialized pool of connections
          val connection = ConnectionPool.getConnection()
          partitionOfRecords.foreach(record => connection.send(record))
          ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
      })
  })

In this case processing and the insertion is done in the workers. There, we don’t use batch insert in db. How about this use case, where we can process(parse string JSON to obj) and send back those objects to master and then send a bulk insert request. Is there any benefit for sending individually using connection pool vs use of bulk operation in the master?
	
A.K.M. Ashrafuzzaman
Lead Software Engineer
NewsCred

(M) 880-175-5592433
Twitter | Blog | Facebook

Check out The Academy, your #1 source
for free content marketing resources


Re: Bulk insert strategy

Posted by Ashrafuzzaman <as...@gmail.com>.
Yes so that brings me to another question. How do I do a batch insert from
worker?
In prod we are planning to put a 3 shared kinesis. So the number of
partitions should be 3. Right?
On Mar 8, 2015 8:57 PM, "Ted Yu" <yu...@gmail.com> wrote:

> What's the expected number of partitions in your use case ?
>
> Have you thought of doing batching in the workers ?
>
> Cheers
>
> On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman <
> ashrafuzzaman.g2@gmail.com> wrote:
>
>> While processing DStream in the Spark Programming Guide, the suggested
>> usage of connection is the following,
>>
>> dstream.foreachRDD(rdd => {
>>       rdd.foreachPartition(partitionOfRecords => {
>>           // ConnectionPool is a static, lazily initialized pool of connections
>>           val connection = ConnectionPool.getConnection()
>>           partitionOfRecords.foreach(record => connection.send(record))
>>           ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
>>       })
>>   })
>>
>>
>> In this case processing and the insertion is done in the workers. There,
>> we don’t use batch insert in db. How about this use case, where we can
>> process(parse string JSON to obj) and send back those objects to master and
>> then send a bulk insert request. Is there any benefit for sending
>> individually using connection pool vs use of bulk operation in the master?
>>
>> A.K.M. Ashrafuzzaman
>> Lead Software Engineer
>> NewsCred <http://www.newscred.com/>
>>
>> (M) 880-175-5592433
>> Twitter <https://twitter.com/ashrafuzzaman> | Blog
>> <http://jitu-blog.blogspot.com/> | Facebook
>> <https://www.facebook.com/ashrafuzzaman.jitu>
>>
>> Check out The Academy <http://newscred.com/theacademy>, your #1 source
>> for free content marketing resources
>>
>>
>

Re: Bulk insert strategy

Posted by Ted Yu <yu...@gmail.com>.
What's the expected number of partitions in your use case ?

Have you thought of doing batching in the workers ?

Cheers

On Sat, Mar 7, 2015 at 10:54 PM, A.K.M. Ashrafuzzaman <
ashrafuzzaman.g2@gmail.com> wrote:

> While processing DStream in the Spark Programming Guide, the suggested
> usage of connection is the following,
>
> dstream.foreachRDD(rdd => {
>       rdd.foreachPartition(partitionOfRecords => {
>           // ConnectionPool is a static, lazily initialized pool of connections
>           val connection = ConnectionPool.getConnection()
>           partitionOfRecords.foreach(record => connection.send(record))
>           ConnectionPool.returnConnection(connection)  // return to the pool for future reuse
>       })
>   })
>
>
> In this case processing and the insertion is done in the workers. There,
> we don’t use batch insert in db. How about this use case, where we can
> process(parse string JSON to obj) and send back those objects to master and
> then send a bulk insert request. Is there any benefit for sending
> individually using connection pool vs use of bulk operation in the master?
>
> A.K.M. Ashrafuzzaman
> Lead Software Engineer
> NewsCred <http://www.newscred.com/>
>
> (M) 880-175-5592433
> Twitter <https://twitter.com/ashrafuzzaman> | Blog
> <http://jitu-blog.blogspot.com/> | Facebook
> <https://www.facebook.com/ashrafuzzaman.jitu>
>
> Check out The Academy <http://newscred.com/theacademy>, your #1 source
> for free content marketing resources
>
>