You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Guillermo Ortiz <ko...@gmail.com> on 2015/12/14 23:52:25 UTC

Kafka Sink, bad distribution of data in the partitions.

I'm using a an architecture as:
Logs --> SpoolDir -->MemChannel --> AvroSink  -->
AvroSource --> MemChannel --> KafkaSink.

I have a cluster with three kafka nodes and have created a topic with six
partitions and replication factor one to make a POC.

I have seen that 95% of the data goes to two partitions, these two
partitions are in the same kafka node. I am not creating a "key" header on
my events in Flume. So, reading the documentation the key is generated
randomly. The messages are logs from different sources. Is it normal this
behavior?

Re: Kafka Sink avro event

Posted by Gonzalo Herreros <gh...@gmail.com>.

Why don't you use a AvroSink and Source to link the tiers? I believe it
will preserve the headers.
You can still use Kafka as the channel if you want it's reliability

Regards,
Gonzalo

On 17 December 2015 at 20:02, Jean <la...@yahoo.fr> wrote:

> Hello,
> I have this configuration :
> Source agent 1=> channel => kafka sink => kafka => source kafka agent 2 =>
> channel => custom sink.
>
> It works but the avro headers generated in agent 1 are lost on the second
> flume agent.
> Is there any way to sink full avro event (header and body) in kafka
> instead of body only ?
>
> Thx
> Jean
>
>

Kafka Sink avro event

Posted by Jean <la...@yahoo.fr>.

Hello,
I have this configuration :
Source agent 1=> channel => kafka sink => kafka => source kafka agent 2 => channel => custom sink.

It works but the avro headers generated in agent 1 are lost on the second flume agent.
Is there any way to sink full avro event (header and body) in kafka instead of body only ?

Thx
Jean

Re: Kafka Sink, bad distribution of data in the partitions.

Posted by Gonzalo Herreros <gh...@gmail.com>.

Unless you are using a custom partitioner, the DefaultPartitioner assigns
them randomly so the content of the headers shouldn't make any difference.
The only explanation I can see for what you are seeing is that somehow the
producer thinks there are only 2.
Are the msgs going just to 0 and 1 or different numbers? Can you try with
another topic and see if that happens too?

How are you checking where are the msg going?

Regards,
Gonzalo

On 14 December 2015 at 22:52, Guillermo Ortiz <ko...@gmail.com> wrote:

> I'm using a an architecture as:
> Logs --> SpoolDir -->MemChannel --> AvroSink  -->
> AvroSource --> MemChannel --> KafkaSink.
>
> I have a cluster with three kafka nodes and have created a topic with six
> partitions and replication factor one to make a POC.
>
> I have seen that 95% of the data goes to two partitions, these two
> partitions are in the same kafka node. I am not creating a "key" header on
> my events in Flume. So, reading the documentation the key is generated
> randomly. The messages are logs from different sources. Is it normal this
> behavior?
>