You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Dominik Safaric <do...@gmail.com> on 2016/08/25 16:13:19 UTC

Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

A few days ago, I’ve started migrating for the purpose of a benchmark onto stream processing engines an entire collection from MongoDB to a Kafka log. 

In summary, the MongoDB collection contains approximately 560 million documents of mean size 2529 bytes that I am in time of writing this still migrating 
to Kafka - 3 days passed and counting. 

The configuration of the instance I run the migration is as follows: 16 x 2.6 GHz CPU, 64GB of RAM, 4.6TB of hard drive. 

The Python script I’ve wrote is as follows: 

class Arguments:
  def __init__(self, arguments):
    self.arguments = arguments
  @property
  def mongodb_host(self):
    return self.arguments['mongodb.host']
  @property
  def mongodb_database(self):
    return self.arguments['mongodb.database']
  @property
  def mongodb_collection(self):
    return self.arguments['mongodb.collection']
  @property
  def kafka_bootstrap_servers(self):
    return self.arguments['kafka.bootstrap.servers']
  @property
  def kafka_topic(self):
    return self.arguments['kafka.topic']

argument_parser = argparse.ArgumentParser(description='MongoDB collection to Apache Kafka topic migration script')

argument_parser.add_argument('-m', '--mongodb.host', default='mongodb://localhost:27017', help='MongoDB hostname:port')
argument_parser.add_argument('-d', '--mongodb.database', required=True, help="MongoDB database")
argument_parser.add_argument('-c', '--mongodb.collection', required=True, help='MongoDB source collection')
argument_parser.add_argument('-k', '--kafka.bootstrap.servers', default='localhost:9092', help='Apache Kafka bootstrap server')
argument_parser.add_argument('-t', '--kafka.topic', required=True, help='Apache Kafka target topic')

arguments = Arguments(vars(argument_parser.parse_args()))

mongodb_client = MongoClient(arguments.mongodb_host)
mongodb_database = mongodb_client[arguments.mongodb_database]

kafka_producer = KafkaProducer(bootstrap_servers=[arguments.kafka_bootstrap_servers], value_serializer=lambda x: dumps(x).encode('ascii'), acks='all', retries=3)

for document in mongodb_database[arguments.mongodb_collection].with_options(codec_options=CodecOptions(unicode_decode_error_handler='ignore')).find():
  kafka_producer.send(arguments.kafka_topic, document)

kafka_producer.flush()

While the Kafka configuration I use is in summary: replication factor 1, number of log partitions equal to 6, number of IO threads, 8, number of network threads 3. 

Based on this information I have the following questions:

How can I improve the overall performance? I am aware of the fact that the overhead might be due to the MongoDB module I use. 
Can I improve the performance by increasing the number of log partitions? In addition, what is the “secret” behind setting an optimum number of log partitions?
Can I improve the performance by increasing the number of IO threads, considering the hardware configuration of mine?  
By increasing e.g. the number of log partitions in order to increase throughput, is the log message consuming performance going to be decreased?  

If I missed any relevant information, please do not hesitate of asking.

Thanks a lot for your help!

Kind regards,
Dominik Safaric 

—
Dominik Safaric | Software Engineer
+385 91 606 9504 | d <ma...@sig.eu>ominik.safaric@media-soft.info
Media-Soft | http://www.media-soft.info 


Re: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Posted by Sharninder <sh...@gmail.com>.
I think what Dana is suggesting is that since Python isn't doing a good job
utilising all the available CPU power, you could run multiple python
processes to process the load. Divide the mongodb collection between, say,
4 parts and process each part with one python process. On kafka side.

Or use a multi threaded java producer that is able to use the machine
optimally.


On Thu, Aug 25, 2016 at 10:21 PM, Dominik Safaric <do...@gmail.com>
wrote:

> Dear Dana,
>
> > I would recommend
> > other tools for bulk transfers.
>
>
> What tools/languages would you rather recommend then using Python?
>
> I could for sure accomplish the same by using the native Java Kafka
> Producer API, but should this really affect the performance under the
> assumption that the Kafka configuration stays as is?
>
> > On 25 Aug 2016, at 18:43, Dana Powers <da...@gmail.com> wrote:
> >
> > python is generally restricted to a single CPU, and kafka-python will max
> > out a single CPU well before it maxes a network card. I would recommend
> > other tools for bulk transfers. Otherwise you may find that partitioning
> > your data set and running separate python processes for each will
> increase
> > the overall CPU available and therefore the throughput.
> >
> > One day I will spend time improving the CPU performance of kafka-python,
> > but probably not in the near term.
> >
> > -Dana
>
>


-- 
--
Sharninder

Re: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Posted by Dana Powers <da...@gmail.com>.
kafka-python includes some benchmarking scripts in
https://github.com/dpkp/kafka-python/tree/master/benchmarks

The concurrency and execution model of the JVM are both significantly
different than python. I would definitely recommend some background reading
on CPython GIL if you are interested on python threads being restricted to
a single CPU.

-Dana

On Thu, Aug 25, 2016 at 9:53 AM, Tauzell, Dave <Da...@surescripts.com>
wrote:
> I would write a python client that writes dummy data to kafka to measure
how fast you can write to Kafka without MongoDB in the mix. I've been doing
load testing recently can with 3 brokers I can write 100MB/s (using Java
clients).
>
> -Dave
>
> -----Original Message-----
> From: Dominik Safaric [mailto:dominiksafaric@gmail.com]
> Sent: Thursday, August 25, 2016 11:51 AM
> To: users@kafka.apache.org
> Subject: Re: Kafka Producer performance - 400GB of transfer on single
instance taking > 72 hours?
>
> Dear Dana,
>
>> I would recommend
>> other tools for bulk transfers.
>
>
> What tools/languages would you rather recommend then using Python?
>
> I could for sure accomplish the same by using the native Java Kafka
Producer API, but should this really affect the performance under the
assumption that the Kafka configuration stays as is?
>
>> On 25 Aug 2016, at 18:43, Dana Powers <da...@gmail.com> wrote:
>>
>> python is generally restricted to a single CPU, and kafka-python will
>> max out a single CPU well before it maxes a network card. I would
>> recommend other tools for bulk transfers. Otherwise you may find that
>> partitioning your data set and running separate python processes for
>> each will increase the overall CPU available and therefore the
throughput.
>>
>> One day I will spend time improving the CPU performance of
>> kafka-python, but probably not in the near term.
>>
>> -Dana
>
> This e-mail and any files transmitted with it are confidential, may
contain sensitive information, and are intended solely for the use of the
individual or entity to whom they are addressed. If you have received this
e-mail in error, please notify the sender by reply e-mail immediately and
destroy all copies of the e-mail and any attachments.

RE: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Posted by "Tauzell, Dave" <Da...@surescripts.com>.
I would write a python client that writes dummy data to kafka to measure how fast you can write to Kafka without MongoDB in the mix.  I've been doing load testing recently can with 3 brokers I can write 100MB/s (using Java clients).

-Dave

-----Original Message-----
From: Dominik Safaric [mailto:dominiksafaric@gmail.com]
Sent: Thursday, August 25, 2016 11:51 AM
To: users@kafka.apache.org
Subject: Re: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Dear Dana,

> I would recommend
> other tools for bulk transfers.


What tools/languages would you rather recommend then using Python?

I could for sure accomplish the same by using the native Java Kafka Producer API, but should this really affect the performance under the assumption that the Kafka configuration stays as is?

> On 25 Aug 2016, at 18:43, Dana Powers <da...@gmail.com> wrote:
>
> python is generally restricted to a single CPU, and kafka-python will
> max out a single CPU well before it maxes a network card. I would
> recommend other tools for bulk transfers. Otherwise you may find that
> partitioning your data set and running separate python processes for
> each will increase the overall CPU available and therefore the throughput.
>
> One day I will spend time improving the CPU performance of
> kafka-python, but probably not in the near term.
>
> -Dana

This e-mail and any files transmitted with it are confidential, may contain sensitive information, and are intended solely for the use of the individual or entity to whom they are addressed. If you have received this e-mail in error, please notify the sender by reply e-mail immediately and destroy all copies of the e-mail and any attachments.

Re: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Posted by Dominik Safaric <do...@gmail.com>.
Dear Dana,

> I would recommend
> other tools for bulk transfers.


What tools/languages would you rather recommend then using Python? 

I could for sure accomplish the same by using the native Java Kafka Producer API, but should this really affect the performance under the assumption that the Kafka configuration stays as is?  

> On 25 Aug 2016, at 18:43, Dana Powers <da...@gmail.com> wrote:
> 
> python is generally restricted to a single CPU, and kafka-python will max
> out a single CPU well before it maxes a network card. I would recommend
> other tools for bulk transfers. Otherwise you may find that partitioning
> your data set and running separate python processes for each will increase
> the overall CPU available and therefore the throughput.
> 
> One day I will spend time improving the CPU performance of kafka-python,
> but probably not in the near term.
> 
> -Dana


Re: Kafka Producer performance - 400GB of transfer on single instance taking > 72 hours?

Posted by Dana Powers <da...@gmail.com>.
python is generally restricted to a single CPU, and kafka-python will max
out a single CPU well before it maxes a network card. I would recommend
other tools for bulk transfers. Otherwise you may find that partitioning
your data set and running separate python processes for each will increase
the overall CPU available and therefore the throughput.

One day I will spend time improving the CPU performance of kafka-python,
but probably not in the near term.

-Dana