You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Feroze Daud <kh...@yahoo.com.INVALID> on 2015/10/07 05:34:04 UTC

log compaction scaling with ~100m messages

hi!
We have a use case where we want to store ~100m keys in kafka. Is there any problem with this approach?
I have heard from some people using kafka, that kafka has a problem when doing log compaction with those many number of keys.
Another topic might have around 10 different K/V pairs for each key in the primary topic. The primary topic's keyspace is approx of 100m keys. We would like to store this in kafka because we are doing a lot of stream processing on these messages, and want to avoid writing another process to recompute data from snapshots.
So, in summary:
primary topic: ~100m keyssecondary topic: ~1B keys
Is it feasible to use log compaction at such a scale of data?
Thanks
feroze.

Re: log compaction scaling with ~100m messages

Posted by vipul jhawar <vi...@gmail.com>.

Just want to chime on this question as this does seem a good option to
avoid some memory hungry K,V store in case we are ok with some async
processing. There are cases where you want a combination of some near
realtime and some offline processing of the same index and as kafka topic
is much efficient in terms of memory and you can scoop out the messages
much faster on basis of the parallelism v/s introducing another component
in your store just to iterate over the keys and produce results.

Considering Joel's feedback should we really avoid this option at all or
should we shard it lot and then be ok to store approx. 500 million or a
billion messages in the topic ?

Thanks

On Thu, Oct 8, 2015 at 12:46 PM, Jan Filipiak <Ja...@trivago.com>
wrote:

> Hi,
>
> just want to pick this up again. You can always use more partitions to
> reduce the number of keys handled by a single broker and parallelize the
> compaction. So with sufficient number of machines and the ability to
> partition I don’t see you running into problems.
>
> Jan
>
>
> On 07.10.2015 05:34, Feroze Daud wrote:
>
>> hi!
>> We have a use case where we want to store ~100m keys in kafka. Is there
>> any problem with this approach?
>> I have heard from some people using kafka, that kafka has a problem when
>> doing log compaction with those many number of keys.
>> Another topic might have around 10 different K/V pairs for each key in
>> the primary topic. The primary topic's keyspace is approx of 100m keys. We
>> would like to store this in kafka because we are doing a lot of stream
>> processing on these messages, and want to avoid writing another process to
>> recompute data from snapshots.
>> So, in summary:
>> primary topic: ~100m keyssecondary topic: ~1B keys
>> Is it feasible to use log compaction at such a scale of data?
>> Thanks
>> feroze.
>>
>
>

Re: log compaction scaling with ~100m messages

Posted by Jan Filipiak <Ja...@trivago.com>.

Hi,

just want to pick this up again. You can always use more partitions to 
reduce the number of keys handled by a single broker and parallelize the 
compaction. So with sufficient number of machines and the ability to 
partition I don’t see you running into problems.

Jan

On 07.10.2015 05:34, Feroze Daud wrote:
> hi!
> We have a use case where we want to store ~100m keys in kafka. Is there any problem with this approach?
> I have heard from some people using kafka, that kafka has a problem when doing log compaction with those many number of keys.
> Another topic might have around 10 different K/V pairs for each key in the primary topic. The primary topic's keyspace is approx of 100m keys. We would like to store this in kafka because we are doing a lot of stream processing on these messages, and want to avoid writing another process to recompute data from snapshots.
> So, in summary:
> primary topic: ~100m keyssecondary topic: ~1B keys
> Is it feasible to use log compaction at such a scale of data?
> Thanks
> feroze.

Re: log compaction scaling with ~100m messages

Posted by Feroze Daud <kh...@yahoo.com.INVALID>.

Thank you for your response!
Our use case is more similar to a traditional k/v store. We are doing a new process that is going to spit huge amounts of data. We are using kafka as a broker so that downstream clients can all consume from the kafka topic. What we would like to avoid, is writing two systems - a stream processing system that runs on kafka and another one that runs on snapshots. SO for eg, when we ship, we will have a process running on kafka doing stream processing. If we change any business logic in our process we would like to reset our queue level back to zero and reprocess the whole queue.
but if I understand you, it seems that givben our key space, we cant do that?
Our update rate is as follows:
we expect ~150m unique K,V pairs to be created in the initial ship. After that, we expect about 3 updates to each key per year. Updates for all keys will not happen at the same time. so, what do you think? Do you still advise that using the topic as a K/V store with log compatction ( not time base retention ) will work?
If not, is there any other processing paradigm we can look into where we can use the same code for stream processing as well as reprocessing entire dataset? 

     On Wednesday, October 7, 2015 11:16 AM, Joel Koshy <jj...@gmail.com> wrote:

 Using log compaction is well-suited for applications that use Kafka directly and need to persist some state associated with its processing. So something like offset management for consumers is a good fit. Another good use-case is for storing schemas associated with your Kafka topics. These are both very specific to maintaining metadata around your stream processing. Although it can be used for more general K-V storage it is not always a good fit. This is especially true if your key-space is bound to grow significantly over time or has an high update rate. The other aspect is the need to do some sort of caching of your key-value pairs (since otherwise lookups would require scanning the log). So for application-level general K-V storage, you could certainly use Kafka as a persistence mechanism for recording recent updates (with traditional time-based retention), but you would probably want a more suitable K-V store separate from Kafka. I'm not sure this (i.e., traditional db storage) is your use case since you mention "a lot of stream processing on these messages" - so it sounds more like repetitive processing over the entire key space. For that it may be more reasonable. The alternative is to use snapshots and read more recent updates from the updates stream in Kafka. Samza folks may want to weigh in here as well.
That said, to answer your question: sure it is feasible to use log compaction with 1B keys, especially if you have enough brokers, partitions, and log cleaner threads but I'm not sure it is the best approach to take. We did hit various issues (bugs/feature gaps) with log compaction while using it for consumer offset management: e.g., support for compressed messages, various other bugs, but most of these have been resolved.
Hope that helps,

Joel

On Tue, Oct 6, 2015 at 8:34 PM, Feroze Daud <kh...@yahoo.com.invalid> wrote:
> hi!
> We have a use case where we want to store ~100m keys in kafka. Is there any problem with this approach?
> I have heard from some people using kafka, that kafka has a problem when doing log compaction with those many number of keys.
> Another topic might have around 10 different K/V pairs for each key in the primary topic. The primary topic's keyspace is approx of 100m keys. We would like to store this in kafka because we are doing a lot of stream processing on these messages, and want to avoid writing another process to recompute data from snapshots.
> So, in summary:
> primary topic: ~100m keyssecondary topic: ~1B keys
> Is it feasible to use log compaction at such a scale of data?
> Thanks
> feroze.

Re: log compaction scaling with ~100m messages

Posted by Joel Koshy <jj...@gmail.com>.

Using log compaction is well-suited for applications that use Kafka
directly and need to persist some state associated with its processing. So
something like offset management for consumers
<http://www.slideshare.net/jjkoshy/offset-management-in-kafka> is a good
fit. Another good use-case is for storing schemas
<https://github.com/confluentinc/schema-registry/blob/master/core/src/main/java/io/confluent/kafka/schemaregistry/storage/KafkaStore.java>
associated with your Kafka topics. These are both very specific to
maintaining metadata around your stream processing. Although it can be used
for more general K-V storage it is not *always* a good fit. This is
especially true if your key-space is bound to grow significantly over time
or has an high update rate. The other aspect is the need to do some sort of
caching of your key-value pairs (since otherwise lookups would require
scanning the log). So for application-level general K-V storage, you could
certainly use Kafka as a persistence mechanism for recording recent updates
(with traditional time-based retention), but you would probably want a more
suitable K-V store separate from Kafka. I'm not sure this (i.e.,
traditional db storage) is your use case since you mention "a lot of stream
processing on these messages" - so it sounds more like repetitive
processing over the entire key space. For that it may be more reasonable.
The alternative is to use snapshots and read more recent updates from the
updates stream in Kafka. Samza folks may want to weigh in here as well.

That said, to answer your question: sure it is feasible to use log
compaction with 1B keys, especially if you have enough brokers, partitions,
and log cleaner threads but I'm not sure it is the best approach to take.
We did hit various issues (bugs/feature gaps) with log compaction while
using it for consumer offset management: e.g., support for compressed
messages, various other bugs, but most of these have been resolved.

Hope that helps,

Joel

On Tue, Oct 6, 2015 at 8:34 PM, Feroze Daud <kh...@yahoo.com.invalid>
wrote:
> hi!
> We have a use case where we want to store ~100m keys in kafka. Is there
any problem with this approach?
> I have heard from some people using kafka, that kafka has a problem when
doing log compaction with those many number of keys.
> Another topic might have around 10 different K/V pairs for each key in
the primary topic. The primary topic's keyspace is approx of 100m keys. We
would like to store this in kafka because we are doing a lot of stream
processing on these messages, and want to avoid writing another process to
recompute data from snapshots.
> So, in summary:
> primary topic: ~100m keyssecondary topic: ~1B keys
> Is it feasible to use log compaction at such a scale of data?
> Thanks
> feroze.