You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Ahmed Nawar <ah...@gmail.com> on 2015/08/27 20:42:43 UTC
Commit DB Transaction for each partition
Thanks for foreach idea. But once i used it i got empty rdd. I think
because "results" is an iterator.
Yes i know "Map is lazy" but i expected there is solution to force action.
I can not use foreachPartition because i need reuse the new RDD after some
maps.
On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger <co...@koeninger.org> wrote:
>
> Map is lazy. You need an actual action, or nothing will happen. Use
> foreachPartition, or do an empty foreach after the map.
>
> On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar <ah...@gmail.com>
> wrote:
>
>> Dears,
>>
>> I needs to commit DB Transaction for each partition,Not for each row.
>> below didn't work for me.
>>
>>
>> rdd.mapPartitions(partitionOfRecords => {
>>
>> DBConnectionInit()
>>
>> val results = partitionOfRecords.map(......)
>>
>> DBConnection.commit()
>>
>>
>> })
>>
>>
>>
>> Best regards,
>>
>> Ahmed Atef Nawwar
>>
>> Data Management & Big Data Consultant
>>
>>
>>
>>
>>
>>
>> On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> Your kafka broker died or you otherwise had a rebalance.
>>>
>>> Normally spark retries take care of that.
>>>
>>> Is there something going on with your kafka installation, that rebalance
>>> is taking especially long?
>>>
>>> Yes, increasing backoff / max number of retries will "help", but it's
>>> better to figure out what's going on with kafka.
>>>
>>> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <
>>> shushantarora09@gmail.com> wrote:
>>>
>>>> Hi
>>>>
>>>> My streaming application gets killed with below error
>>>>
>>>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>> kafka.common.NotLeaderForPartitionException,
>>>> kafka.common.NotLeaderForPartitionException,
>>>> kafka.common.NotLeaderForPartitionException,
>>>> kafka.common.NotLeaderForPartitionException,
>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>>>> [testtopic,193]))
>>>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating jobs
>>>> for time 1440626120000 ms
>>>> org.apache.spark.SparkException:
>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>> Set([testtopic,115]))
>>>> at
>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>> at
>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>> at
>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>> at
>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>> at
>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>> at
>>>>
>>>>
>>>>
>>>> Kafka params in job logs printed are :
>>>> value.serializer = class
>>>> org.apache.kafka.common.serialization.StringSerializer
>>>> key.serializer = class
>>>> org.apache.kafka.common.serialization.StringSerializer
>>>> block.on.buffer.full = true
>>>> retry.backoff.ms = 100
>>>> buffer.memory = 1048576
>>>> batch.size = 16384
>>>> metrics.sample.window.ms = 30000
>>>> metadata.max.age.ms = 300000
>>>> receive.buffer.bytes = 32768
>>>> timeout.ms = 30000
>>>> max.in.flight.requests.per.connection = 5
>>>> bootstrap.servers = [broker1:9092, broker2:9092, broker3:9092]
>>>> metric.reporters = []
>>>> client.id =
>>>> compression.type = none
>>>> retries = 0
>>>> max.request.size = 1048576
>>>> send.buffer.bytes = 131072
>>>> acks = all
>>>> reconnect.backoff.ms = 10
>>>> linger.ms = 0
>>>> metrics.num.samples = 2
>>>> metadata.fetch.timeout.ms = 60000
>>>>
>>>>
>>>> Is it kafka broker getting down and job is getting killed ? Whats the
>>>> best way to handle it ?
>>>> Increasing retries and backoff time wil help and to what values those
>>>> should be set to never have streaming application failure - rather it keep
>>>> on retrying after few seconds and send a event so that my custom code can
>>>> send notification of kafka broker down if its because of that.
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>
>>
>
Re: Commit DB Transaction for each partition
Posted by Ahmed Nawar <ah...@gmail.com>.
Thanks a lot for your support. It is working now.
I wrote it like below
val newRDD = rdd.mapPartitions { partition => {
val result = partition.map(.....)
result
}
}
newRDD.foreach {
}
On Thu, Aug 27, 2015 at 10:34 PM, Cody Koeninger <co...@koeninger.org> wrote:
> This job contains a spark output action, and is what I originally meant:
>
>
> rdd.mapPartitions {
> result
> }.foreach {
>
> }
>
> This job is just a transformation, and won't do anything unless you have
> another output action. Not to mention, it will exhaust the iterator, as
> you noticed:
>
> rdd.mapPartitions {
> result.foreach
> result
> }
>
>
>
> On Thu, Aug 27, 2015 at 2:22 PM, Ahmed Nawar <ah...@gmail.com>
> wrote:
>
>> Yes, of course, I am doing that. But once i added results.foreach(row=>
>> {}) i pot empty RDD.
>>
>>
>>
>> rdd.mapPartitions(partitionOfRecords => {
>>
>> DBConnectionInit()
>>
>> val results = partitionOfRecords.map(......)
>>
>> DBConnection.commit()
>>
>> results.foreach(row=> {})
>>
>> results
>>
>> })
>>
>>
>>
>> On Thu, Aug 27, 2015 at 10:18 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>> You need to return an iterator from the closure you provide to
>>> mapPartitions
>>>
>>> On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar <ah...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for foreach idea. But once i used it i got empty rdd. I think
>>>> because "results" is an iterator.
>>>>
>>>> Yes i know "Map is lazy" but i expected there is solution to force
>>>> action.
>>>>
>>>> I can not use foreachPartition because i need reuse the new RDD after
>>>> some maps.
>>>>
>>>>
>>>>
>>>> On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>>
>>>>>
>>>>> Map is lazy. You need an actual action, or nothing will happen. Use
>>>>> foreachPartition, or do an empty foreach after the map.
>>>>>
>>>>> On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar <ah...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Dears,
>>>>>>
>>>>>> I needs to commit DB Transaction for each partition,Not for each
>>>>>> row. below didn't work for me.
>>>>>>
>>>>>>
>>>>>> rdd.mapPartitions(partitionOfRecords => {
>>>>>>
>>>>>> DBConnectionInit()
>>>>>>
>>>>>> val results = partitionOfRecords.map(......)
>>>>>>
>>>>>> DBConnection.commit()
>>>>>>
>>>>>>
>>>>>> })
>>>>>>
>>>>>>
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Ahmed Atef Nawwar
>>>>>>
>>>>>> Data Management & Big Data Consultant
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger <co...@koeninger.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Your kafka broker died or you otherwise had a rebalance.
>>>>>>>
>>>>>>> Normally spark retries take care of that.
>>>>>>>
>>>>>>> Is there something going on with your kafka installation, that
>>>>>>> rebalance is taking especially long?
>>>>>>>
>>>>>>> Yes, increasing backoff / max number of retries will "help", but
>>>>>>> it's better to figure out what's going on with kafka.
>>>>>>>
>>>>>>> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <
>>>>>>> shushantarora09@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> My streaming application gets killed with below error
>>>>>>>>
>>>>>>>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>>>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>>>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>>>>>>>> [testtopic,193]))
>>>>>>>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating
>>>>>>>> jobs for time 1440626120000 ms
>>>>>>>> org.apache.spark.SparkException:
>>>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>>>> Set([testtopic,115]))
>>>>>>>> at
>>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>>>>>> at
>>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>>>>>> at
>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>> at
>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>>> at
>>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>>> at
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Kafka params in job logs printed are :
>>>>>>>> value.serializer = class
>>>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>>>> key.serializer = class
>>>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>>>> block.on.buffer.full = true
>>>>>>>> retry.backoff.ms = 100
>>>>>>>> buffer.memory = 1048576
>>>>>>>> batch.size = 16384
>>>>>>>> metrics.sample.window.ms = 30000
>>>>>>>> metadata.max.age.ms = 300000
>>>>>>>> receive.buffer.bytes = 32768
>>>>>>>> timeout.ms = 30000
>>>>>>>> max.in.flight.requests.per.connection = 5
>>>>>>>> bootstrap.servers = [broker1:9092, broker2:9092,
>>>>>>>> broker3:9092]
>>>>>>>> metric.reporters = []
>>>>>>>> client.id =
>>>>>>>> compression.type = none
>>>>>>>> retries = 0
>>>>>>>> max.request.size = 1048576
>>>>>>>> send.buffer.bytes = 131072
>>>>>>>> acks = all
>>>>>>>> reconnect.backoff.ms = 10
>>>>>>>> linger.ms = 0
>>>>>>>> metrics.num.samples = 2
>>>>>>>> metadata.fetch.timeout.ms = 60000
>>>>>>>>
>>>>>>>>
>>>>>>>> Is it kafka broker getting down and job is getting killed ? Whats
>>>>>>>> the best way to handle it ?
>>>>>>>> Increasing retries and backoff time wil help and to what values
>>>>>>>> those should be set to never have streaming application failure - rather it
>>>>>>>> keep on retrying after few seconds and send a event so that my custom code
>>>>>>>> can send notification of kafka broker down if its because of that.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Commit DB Transaction for each partition
Posted by Cody Koeninger <co...@koeninger.org>.
This job contains a spark output action, and is what I originally meant:
rdd.mapPartitions {
result
}.foreach {
}
This job is just a transformation, and won't do anything unless you have
another output action. Not to mention, it will exhaust the iterator, as
you noticed:
rdd.mapPartitions {
result.foreach
result
}
On Thu, Aug 27, 2015 at 2:22 PM, Ahmed Nawar <ah...@gmail.com> wrote:
> Yes, of course, I am doing that. But once i added results.foreach(row=>
> {}) i pot empty RDD.
>
>
>
> rdd.mapPartitions(partitionOfRecords => {
>
> DBConnectionInit()
>
> val results = partitionOfRecords.map(......)
>
> DBConnection.commit()
>
> results.foreach(row=> {})
>
> results
>
> })
>
>
>
> On Thu, Aug 27, 2015 at 10:18 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>> You need to return an iterator from the closure you provide to
>> mapPartitions
>>
>> On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar <ah...@gmail.com>
>> wrote:
>>
>>> Thanks for foreach idea. But once i used it i got empty rdd. I think
>>> because "results" is an iterator.
>>>
>>> Yes i know "Map is lazy" but i expected there is solution to force
>>> action.
>>>
>>> I can not use foreachPartition because i need reuse the new RDD after
>>> some maps.
>>>
>>>
>>>
>>> On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>
>>>>
>>>> Map is lazy. You need an actual action, or nothing will happen. Use
>>>> foreachPartition, or do an empty foreach after the map.
>>>>
>>>> On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar <ah...@gmail.com>
>>>> wrote:
>>>>
>>>>> Dears,
>>>>>
>>>>> I needs to commit DB Transaction for each partition,Not for each
>>>>> row. below didn't work for me.
>>>>>
>>>>>
>>>>> rdd.mapPartitions(partitionOfRecords => {
>>>>>
>>>>> DBConnectionInit()
>>>>>
>>>>> val results = partitionOfRecords.map(......)
>>>>>
>>>>> DBConnection.commit()
>>>>>
>>>>>
>>>>> })
>>>>>
>>>>>
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Ahmed Atef Nawwar
>>>>>
>>>>> Data Management & Big Data Consultant
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger <co...@koeninger.org>
>>>>> wrote:
>>>>>
>>>>>> Your kafka broker died or you otherwise had a rebalance.
>>>>>>
>>>>>> Normally spark retries take care of that.
>>>>>>
>>>>>> Is there something going on with your kafka installation, that
>>>>>> rebalance is taking especially long?
>>>>>>
>>>>>> Yes, increasing backoff / max number of retries will "help", but it's
>>>>>> better to figure out what's going on with kafka.
>>>>>>
>>>>>> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <
>>>>>> shushantarora09@gmail.com> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> My streaming application gets killed with below error
>>>>>>>
>>>>>>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>>>>>>> [testtopic,193]))
>>>>>>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating
>>>>>>> jobs for time 1440626120000 ms
>>>>>>> org.apache.spark.SparkException:
>>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>>> Set([testtopic,115]))
>>>>>>> at
>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>>>>> at
>>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>>>>> at
>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>> at
>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>>> at
>>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>>> at
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Kafka params in job logs printed are :
>>>>>>> value.serializer = class
>>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>>> key.serializer = class
>>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>>> block.on.buffer.full = true
>>>>>>> retry.backoff.ms = 100
>>>>>>> buffer.memory = 1048576
>>>>>>> batch.size = 16384
>>>>>>> metrics.sample.window.ms = 30000
>>>>>>> metadata.max.age.ms = 300000
>>>>>>> receive.buffer.bytes = 32768
>>>>>>> timeout.ms = 30000
>>>>>>> max.in.flight.requests.per.connection = 5
>>>>>>> bootstrap.servers = [broker1:9092, broker2:9092,
>>>>>>> broker3:9092]
>>>>>>> metric.reporters = []
>>>>>>> client.id =
>>>>>>> compression.type = none
>>>>>>> retries = 0
>>>>>>> max.request.size = 1048576
>>>>>>> send.buffer.bytes = 131072
>>>>>>> acks = all
>>>>>>> reconnect.backoff.ms = 10
>>>>>>> linger.ms = 0
>>>>>>> metrics.num.samples = 2
>>>>>>> metadata.fetch.timeout.ms = 60000
>>>>>>>
>>>>>>>
>>>>>>> Is it kafka broker getting down and job is getting killed ? Whats
>>>>>>> the best way to handle it ?
>>>>>>> Increasing retries and backoff time wil help and to what values
>>>>>>> those should be set to never have streaming application failure - rather it
>>>>>>> keep on retrying after few seconds and send a event so that my custom code
>>>>>>> can send notification of kafka broker down if its because of that.
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Commit DB Transaction for each partition
Posted by Ahmed Nawar <ah...@gmail.com>.
Yes, of course, I am doing that. But once i added results.foreach(row=> {})
i pot empty RDD.
rdd.mapPartitions(partitionOfRecords => {
DBConnectionInit()
val results = partitionOfRecords.map(......)
DBConnection.commit()
results.foreach(row=> {})
results
})
On Thu, Aug 27, 2015 at 10:18 PM, Cody Koeninger <co...@koeninger.org> wrote:
> You need to return an iterator from the closure you provide to
> mapPartitions
>
> On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar <ah...@gmail.com>
> wrote:
>
>> Thanks for foreach idea. But once i used it i got empty rdd. I think
>> because "results" is an iterator.
>>
>> Yes i know "Map is lazy" but i expected there is solution to force action.
>>
>> I can not use foreachPartition because i need reuse the new RDD after
>> some maps.
>>
>>
>>
>> On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger <co...@koeninger.org>
>> wrote:
>>
>>>
>>> Map is lazy. You need an actual action, or nothing will happen. Use
>>> foreachPartition, or do an empty foreach after the map.
>>>
>>> On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar <ah...@gmail.com>
>>> wrote:
>>>
>>>> Dears,
>>>>
>>>> I needs to commit DB Transaction for each partition,Not for each
>>>> row. below didn't work for me.
>>>>
>>>>
>>>> rdd.mapPartitions(partitionOfRecords => {
>>>>
>>>> DBConnectionInit()
>>>>
>>>> val results = partitionOfRecords.map(......)
>>>>
>>>> DBConnection.commit()
>>>>
>>>>
>>>> })
>>>>
>>>>
>>>>
>>>> Best regards,
>>>>
>>>> Ahmed Atef Nawwar
>>>>
>>>> Data Management & Big Data Consultant
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger <co...@koeninger.org>
>>>> wrote:
>>>>
>>>>> Your kafka broker died or you otherwise had a rebalance.
>>>>>
>>>>> Normally spark retries take care of that.
>>>>>
>>>>> Is there something going on with your kafka installation, that
>>>>> rebalance is taking especially long?
>>>>>
>>>>> Yes, increasing backoff / max number of retries will "help", but it's
>>>>> better to figure out what's going on with kafka.
>>>>>
>>>>> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <
>>>>> shushantarora09@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> My streaming application gets killed with below error
>>>>>>
>>>>>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>> kafka.common.NotLeaderForPartitionException,
>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>>>>>> [testtopic,193]))
>>>>>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating jobs
>>>>>> for time 1440626120000 ms
>>>>>> org.apache.spark.SparkException:
>>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>>> Set([testtopic,115]))
>>>>>> at
>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>>>> at
>>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>>>> at
>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>> at
>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>>> at
>>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>>> at
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kafka params in job logs printed are :
>>>>>> value.serializer = class
>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>> key.serializer = class
>>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>>> block.on.buffer.full = true
>>>>>> retry.backoff.ms = 100
>>>>>> buffer.memory = 1048576
>>>>>> batch.size = 16384
>>>>>> metrics.sample.window.ms = 30000
>>>>>> metadata.max.age.ms = 300000
>>>>>> receive.buffer.bytes = 32768
>>>>>> timeout.ms = 30000
>>>>>> max.in.flight.requests.per.connection = 5
>>>>>> bootstrap.servers = [broker1:9092, broker2:9092, broker3:9092]
>>>>>> metric.reporters = []
>>>>>> client.id =
>>>>>> compression.type = none
>>>>>> retries = 0
>>>>>> max.request.size = 1048576
>>>>>> send.buffer.bytes = 131072
>>>>>> acks = all
>>>>>> reconnect.backoff.ms = 10
>>>>>> linger.ms = 0
>>>>>> metrics.num.samples = 2
>>>>>> metadata.fetch.timeout.ms = 60000
>>>>>>
>>>>>>
>>>>>> Is it kafka broker getting down and job is getting killed ? Whats the
>>>>>> best way to handle it ?
>>>>>> Increasing retries and backoff time wil help and to what values
>>>>>> those should be set to never have streaming application failure - rather it
>>>>>> keep on retrying after few seconds and send a event so that my custom code
>>>>>> can send notification of kafka broker down if its because of that.
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
Re: Commit DB Transaction for each partition
Posted by Cody Koeninger <co...@koeninger.org>.
You need to return an iterator from the closure you provide to mapPartitions
On Thu, Aug 27, 2015 at 1:42 PM, Ahmed Nawar <ah...@gmail.com> wrote:
> Thanks for foreach idea. But once i used it i got empty rdd. I think
> because "results" is an iterator.
>
> Yes i know "Map is lazy" but i expected there is solution to force action.
>
> I can not use foreachPartition because i need reuse the new RDD after some
> maps.
>
>
>
> On Thu, Aug 27, 2015 at 5:11 PM, Cody Koeninger <co...@koeninger.org>
> wrote:
>
>>
>> Map is lazy. You need an actual action, or nothing will happen. Use
>> foreachPartition, or do an empty foreach after the map.
>>
>> On Thu, Aug 27, 2015 at 8:53 AM, Ahmed Nawar <ah...@gmail.com>
>> wrote:
>>
>>> Dears,
>>>
>>> I needs to commit DB Transaction for each partition,Not for each
>>> row. below didn't work for me.
>>>
>>>
>>> rdd.mapPartitions(partitionOfRecords => {
>>>
>>> DBConnectionInit()
>>>
>>> val results = partitionOfRecords.map(......)
>>>
>>> DBConnection.commit()
>>>
>>>
>>> })
>>>
>>>
>>>
>>> Best regards,
>>>
>>> Ahmed Atef Nawwar
>>>
>>> Data Management & Big Data Consultant
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Aug 27, 2015 at 4:16 PM, Cody Koeninger <co...@koeninger.org>
>>> wrote:
>>>
>>>> Your kafka broker died or you otherwise had a rebalance.
>>>>
>>>> Normally spark retries take care of that.
>>>>
>>>> Is there something going on with your kafka installation, that
>>>> rebalance is taking especially long?
>>>>
>>>> Yes, increasing backoff / max number of retries will "help", but it's
>>>> better to figure out what's going on with kafka.
>>>>
>>>> On Wed, Aug 26, 2015 at 9:07 PM, Shushant Arora <
>>>> shushantarora09@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> My streaming application gets killed with below error
>>>>>
>>>>> 5/08/26 21:55:20 ERROR kafka.DirectKafkaInputDStream:
>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>> kafka.common.NotLeaderForPartitionException,
>>>>> kafka.common.NotLeaderForPartitionException,
>>>>> kafka.common.NotLeaderForPartitionException,
>>>>> kafka.common.NotLeaderForPartitionException,
>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>> Set([testtopic,223], [testtopic,205], [testtopic,64], [testtopic,100],
>>>>> [testtopic,193]))
>>>>> 15/08/26 21:55:20 ERROR scheduler.JobScheduler: Error generating jobs
>>>>> for time 1440626120000 ms
>>>>> org.apache.spark.SparkException:
>>>>> ArrayBuffer(kafka.common.NotLeaderForPartitionException,
>>>>> org.apache.spark.SparkException: Couldn't find leader offsets for
>>>>> Set([testtopic,115]))
>>>>> at
>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.latestLeaderOffsets(DirectKafkaInputDStream.scala:94)
>>>>> at
>>>>> org.apache.spark.streaming.kafka.DirectKafkaInputDStream.compute(DirectKafkaInputDStream.scala:116)
>>>>> at
>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>> at
>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:300)
>>>>> at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
>>>>> at
>>>>> org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1.apply(DStream.scala:299)
>>>>> at
>>>>>
>>>>>
>>>>>
>>>>> Kafka params in job logs printed are :
>>>>> value.serializer = class
>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>> key.serializer = class
>>>>> org.apache.kafka.common.serialization.StringSerializer
>>>>> block.on.buffer.full = true
>>>>> retry.backoff.ms = 100
>>>>> buffer.memory = 1048576
>>>>> batch.size = 16384
>>>>> metrics.sample.window.ms = 30000
>>>>> metadata.max.age.ms = 300000
>>>>> receive.buffer.bytes = 32768
>>>>> timeout.ms = 30000
>>>>> max.in.flight.requests.per.connection = 5
>>>>> bootstrap.servers = [broker1:9092, broker2:9092, broker3:9092]
>>>>> metric.reporters = []
>>>>> client.id =
>>>>> compression.type = none
>>>>> retries = 0
>>>>> max.request.size = 1048576
>>>>> send.buffer.bytes = 131072
>>>>> acks = all
>>>>> reconnect.backoff.ms = 10
>>>>> linger.ms = 0
>>>>> metrics.num.samples = 2
>>>>> metadata.fetch.timeout.ms = 60000
>>>>>
>>>>>
>>>>> Is it kafka broker getting down and job is getting killed ? Whats the
>>>>> best way to handle it ?
>>>>> Increasing retries and backoff time wil help and to what values those
>>>>> should be set to never have streaming application failure - rather it keep
>>>>> on retrying after few seconds and send a event so that my custom code can
>>>>> send notification of kafka broker down if its because of that.
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>
>>>
>>
>