You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by nick xander <ni...@gmail.com> on 2016/04/02 00:34:40 UTC

Re: Kafka 0.9 as part of Samza 0.10?

Hi Yi,

        Thanks for the clarification, it was helpful.


I would also like to know your views on the below issues and if you have
employed something to overcome those.

LogCompaction Issues:

https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager cache
should prevent stale-offset-cleanup while an offset load is in progress;
otherwise we can lose consumer offsets – *Might be an issue as it will
result in no offset to be read thereby failing the bootstrap of local key
value store*

https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot clean
after shutdown during replaceSegments –
*Will prevent reading log compacted topic causing failure of local key
value store bootstrap*

https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset map
overflow –
*Will probably be an issue for some clients who has smaller  message size
and large number of keys. They need to fine tune a lot to make sure that
this doesn't happen.*



Replication Issues:

https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously
deleting all segments in partition –
*Will cause the data in changelog topic to be lost resulting in failure of
local key value store bootstrap. *


Though Samza can be plugged with different messaging systems, Kafka is the
major system that is supported today for state-full processing. If that's
the case the following bugs will potentially make Samza also to not work
properly (Ex: if there is replication issue called out below in a log
compacted topic happens, then Samza might not be able to restore its local
key value store).. Since you are running Samza with state-full processing,
the above issues might result your Samza job with key value store in an
in-consistent state. Are you using Samza with stateful processing for
critical applications which cannot tolerate loss of data or
inconsistencies? (Because with the above bugs you might not be able to run
the job for critical application as it might fail if it is hit with the
above issues). I believe that upgrading to 0.9 Kafka is much critical to
ensure that Samza also works properly (I do understand that its not a issue
with Samza, but I believe that the one of the primary reason for
customers/devs choosing Samza is its fine ability to do state-full
processing and if that is not working or will fail due to dependency on
Kafka, it becomes necessary to upgrade to Kafka asap), please correct me if
I am wrong here.


Thanks,

Nick




------------------------------



Hi, Nick,



Let me try to answer in-between the lines:



On Thu, Mar 31, 2016 at 12:49 AM, nick xander <ni...@gmail.com>

wrote:



>

> * Do you guys experience issue with Kafka when it is used with log

> compaction for Samza's state full management?

>



The critical issue on log-compaction in Kafka that we care about is the

case where message compression and log-compaction are *both* used in the

same topic. Currently, for changelog topics, we forcefully turned off

compression. Hence, it is not a problem for Samza's KV-stores. It is still

a problem for checkpoint topics if the Kafka producer is configured to use

message compression.





> * What is the avg number of keys per partition that you have observed in

> Kafka's log compacted topic for state full management, total number of

> partition, replication factor and number of Kafka brokers?

>



This number varies *a lot*, depending on how big your KV-store is. For

example, we have seem around 5-10GB of RocksDB KV-stores being stored in

changelog in LinkedIn. That will cause a long bootstrap time when the

container is restarted on a different host. Hence, we included

host-affinity feature in Samza 0.10, which cut down the bootstrap time for

that particular job by 20x.





> * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it

> seems critical if Samza is used for stateful management? And what is the

> timeline for Samza 0.10.1 that you are expecting?

>



We are planning to release Samza 0.10.1 very soon and are working on

pending code reviews and validations now. Depending on the test/validation

cycles, we hope to get Samza 0.10.1 release candidate ready in a month or

so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the tight

release timeline this time.





> * What is recommendation between the usage of Samza vs Kafka connect?

> Should we use Samza for state full management and Kafka connect for other

> stateless streaming soslution?

>

>

KafkaConnect is mainly an ingest/output connector to/from Kafka, not having

much stateful processing. Samza actually does both ingest/output and

stateful process. If there are input data sources that Samza does not have

a SystemConsumer implementation for yet, you can definitely use

KafkaConnect for ingestion and Samza for stateful processing.



Hope the above answered your questions.



Thanks!



-Yi



On Thu, Mar 31, 2016 at 9:49 AM, nick xander <ni...@gmail.com>
wrote:

> Hi All,
>     As per this article:
> http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0
> there are some well know bugs and feature improvements around log
> compaction (state full management in Samza) and Replication. I also saw in
> Samza issues about this upgrade:
> https://issues.apache.org/jira/browse/SAMZA-855. My questions here:
>
> * Do you guys experience issue with Kafka when it is used with log
> compaction for Samza's state full management?
> * What is the avg number of keys per partition that you have observed in
> Kafka's log compacted topic for state full management, total number of
> partition, replication factor and number of Kafka brokers?
> * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it
> seems critical if Samza is used for stateful management? And what is the
> timeline for Samza 0.10.1 that you are expecting?
> * What is recommendation between the usage of Samza vs Kafka connect?
> Should we use Samza for state full management and Kafka connect for other
> stateless streaming soslution?
>
> Thanks,
> Nick
>

Re: Kafka 0.9 as part of Samza 0.10?

Posted by Yi Pan <ni...@gmail.com>.
Hi, Nick,

Thanks for digging out the details from KAFKA JIRAs! I appreciated it!

As for upgrading to Kafka 0.9 to fix those critical issues, I am totally w/
you. The discussion on whether Samza 0.10.1 should include Kafka 0.9 fixes
or not has just started (by your thread :)). So, we are happy to
accommodate the request if the community has need for that.

As for LinkedIn deployment, we actually have already deployed an internal
version of Kafka that has most of the 0.9 fixes for log-compaction w/
compressed messages. I will need to check w/ our Kafka team to see whether
the bugs you mentioned also is included. There is a bit concern on pushing
out Kafka 0.9 (with client libs) to Samza 0.10.1 due to the fact that some
of the community members are still running Kafka 0.8.2 brokers in their
production and this change might incur some migration cost. Besides, Kafka
0.9 also introduces a new client library changes that requires code change
in Samza's KafkaSystemConsumer/KafkaSystemProducer. Hence, our original
thought is to keep Samza 0.10.1 as a light-weighted release and incorporate
Kafka 0.9 in the next major release.

However, if Kafka 0.9 brokers are supporting Kafka 0.8.2 clients, I don't
think that it should block you from using Kafka 0.9 broker and Samza 0.10
together to fix the server side issues you mentioned. If there any
client-side change in Samza that is needed, we are happy to help and if
necessary, we can also change the scope of Samza 0.10.1 to include Kafka
0.9 client libraries.

Please let me know if the above works for you. If not, let me know the
specific issues that we need to use Kafka 0.9 client and we can find a
solution together.

Thanks a lot!

-Yi

On Fri, Apr 1, 2016 at 3:34 PM, nick xander <ni...@gmail.com> wrote:

> Hi Yi,
>
>         Thanks for the clarification, it was helpful.
>
>
> I would also like to know your views on the below issues and if you have
> employed something to overcome those.
>
> LogCompaction Issues:
>
> https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager cache
> should prevent stale-offset-cleanup while an offset load is in progress;
> otherwise we can lose consumer offsets – *Might be an issue as it will
> result in no offset to be read thereby failing the bootstrap of local key
> value store*
>
> https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot clean
> after shutdown during replaceSegments –
> *Will prevent reading log compacted topic causing failure of local key
> value store bootstrap*
>
> https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset map
> overflow –
> *Will probably be an issue for some clients who has smaller  message size
> and large number of keys. They need to fine tune a lot to make sure that
> this doesn't happen.*
>
>
>
> Replication Issues:
>
> https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously
> deleting all segments in partition –
> *Will cause the data in changelog topic to be lost resulting in failure of
> local key value store bootstrap. *
>
>
> Though Samza can be plugged with different messaging systems, Kafka is the
> major system that is supported today for state-full processing. If that's
> the case the following bugs will potentially make Samza also to not work
> properly (Ex: if there is replication issue called out below in a log
> compacted topic happens, then Samza might not be able to restore its local
> key value store).. Since you are running Samza with state-full processing,
> the above issues might result your Samza job with key value store in an
> in-consistent state. Are you using Samza with stateful processing for
> critical applications which cannot tolerate loss of data or
> inconsistencies? (Because with the above bugs you might not be able to run
> the job for critical application as it might fail if it is hit with the
> above issues). I believe that upgrading to 0.9 Kafka is much critical to
> ensure that Samza also works properly (I do understand that its not a issue
> with Samza, but I believe that the one of the primary reason for
> customers/devs choosing Samza is its fine ability to do state-full
> processing and if that is not working or will fail due to dependency on
> Kafka, it becomes necessary to upgrade to Kafka asap), please correct me if
> I am wrong here.
>
>
> Thanks,
>
> Nick
>
>
>
>
> ------------------------------
>
>
>
> Hi, Nick,
>
>
>
> Let me try to answer in-between the lines:
>
>
>
> On Thu, Mar 31, 2016 at 12:49 AM, nick xander <ni...@gmail.com>
>
> wrote:
>
>
>
> >
>
> > * Do you guys experience issue with Kafka when it is used with log
>
> > compaction for Samza's state full management?
>
> >
>
>
>
> The critical issue on log-compaction in Kafka that we care about is the
>
> case where message compression and log-compaction are *both* used in the
>
> same topic. Currently, for changelog topics, we forcefully turned off
>
> compression. Hence, it is not a problem for Samza's KV-stores. It is still
>
> a problem for checkpoint topics if the Kafka producer is configured to use
>
> message compression.
>
>
>
>
>
> > * What is the avg number of keys per partition that you have observed in
>
> > Kafka's log compacted topic for state full management, total number of
>
> > partition, replication factor and number of Kafka brokers?
>
> >
>
>
>
> This number varies *a lot*, depending on how big your KV-store is. For
>
> example, we have seem around 5-10GB of RocksDB KV-stores being stored in
>
> changelog in LinkedIn. That will cause a long bootstrap time when the
>
> container is restarted on a different host. Hence, we included
>
> host-affinity feature in Samza 0.10, which cut down the bootstrap time for
>
> that particular job by 20x.
>
>
>
>
>
> > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it
>
> > seems critical if Samza is used for stateful management? And what is the
>
> > timeline for Samza 0.10.1 that you are expecting?
>
> >
>
>
>
> We are planning to release Samza 0.10.1 very soon and are working on
>
> pending code reviews and validations now. Depending on the test/validation
>
> cycles, we hope to get Samza 0.10.1 release candidate ready in a month or
>
> so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the tight
>
> release timeline this time.
>
>
>
>
>
> > * What is recommendation between the usage of Samza vs Kafka connect?
>
> > Should we use Samza for state full management and Kafka connect for other
>
> > stateless streaming soslution?
>
> >
>
> >
>
> KafkaConnect is mainly an ingest/output connector to/from Kafka, not having
>
> much stateful processing. Samza actually does both ingest/output and
>
> stateful process. If there are input data sources that Samza does not have
>
> a SystemConsumer implementation for yet, you can definitely use
>
> KafkaConnect for ingestion and Samza for stateful processing.
>
>
>
> Hope the above answered your questions.
>
>
>
> Thanks!
>
>
>
> -Yi
>
>
>
> On Thu, Mar 31, 2016 at 9:49 AM, nick xander <ni...@gmail.com>
> wrote:
>
> > Hi All,
> >     As per this article:
> >
> http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0
> > there are some well know bugs and feature improvements around log
> > compaction (state full management in Samza) and Replication. I also saw
> in
> > Samza issues about this upgrade:
> > https://issues.apache.org/jira/browse/SAMZA-855. My questions here:
> >
> > * Do you guys experience issue with Kafka when it is used with log
> > compaction for Samza's state full management?
> > * What is the avg number of keys per partition that you have observed in
> > Kafka's log compacted topic for state full management, total number of
> > partition, replication factor and number of Kafka brokers?
> > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it
> > seems critical if Samza is used for stateful management? And what is the
> > timeline for Samza 0.10.1 that you are expecting?
> > * What is recommendation between the usage of Samza vs Kafka connect?
> > Should we use Samza for state full management and Kafka connect for other
> > stateless streaming soslution?
> >
> > Thanks,
> > Nick
> >
>

Re: Kafka 0.9 as part of Samza 0.10?

Posted by Neha Narkhede <ne...@confluent.io>.
Nick,

The relationship between Kafka Connect
<http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines>
and any stream processing system (whether it is Samza, Kafka Streams or
anything else) is very complementary. Kafka Connect makes data available in
Kafka that stream processing systems can then process.

The purpose of Kafka Connect is to offer a framework for using real-time
streaming ingestion connectors to Kafka in the easiest way possible without
having to write extra code. It builds on top of Kafka primitives to offer
fault-tolerance, offset management (very soon exactly-once), scalability
that every connector needs. Since the Kafka community announced it, the
community has built 20 open-source connectors
<http://www.confluent.io/developers/connectors> :-)

Kafka Streams
<http://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple>,
on the other hand, is a lightweight library that offers stateful stream
processing capability. It is meant to make an application developer's life
easy by allowing a simple embeddable library for doing all sorts of stream
processing operations. Kafka Streams is available as part of Apache Kafka.

As far as the relationship between Kafka Connect and Samza is concerned,
this <http://www.confluent.io/blog/hello-world-kafka-connect-kafka-streams>
blog post talks about building the Hello Samza wikipedia example using
Kafka Connect and Kafka Streams. You might find it useful.

Thanks,
Neha

On Thu, Apr 7, 2016 at 10:00 PM, Kartik Paramasivam <
kparamasivam@linkedin.com.invalid> wrote:

> The log compaction fixes in Kafka 0.9 were done by our linkedIn Kafka
> developers to fix issues faced by Samza.
>
> So Yes.. Samza 0.10 can be safely used with Kafka 0.9.
> At LinkedIn we currently run kafka broker from apache kafka trunk.. So it
> is a more recent Kafka version.  But that should be fine.
>
> Regarding Kafka Connect : We don't use it at LinkedIn.  So we don't have
> any position on this.   We do have several Samza jobs at linkedIn which
> have implemented system consumers to read from Kinesis and Dynamo DB
> streams and publish to Kafka.   We also have been working on a new version
> of Databus which will be built on top of Kafka (it is similar to Kafka
> connect)
>
>
> On Thu, Apr 7, 2016 at 3:51 PM, nick xander <ni...@gmail.com>
> wrote:
>
> > Hi Yi,
> >
> >         Thanks for the support, really appreciate it to have an
> > active/supportive community. Makes sense to not upgrade Samza to use
> Kafka
> > 0.9 new client(which doesn’t rely on Zookeeper) because it might break
> the
> > clients using Kafka 0.8.2 broker. But as you were saying we might be able
> > to use 0.9 Broker with Samza 0.10 (Can you please confirm this? I tried
> > going through different documentation, seems possible. But “This means
> that
> > upgraded brokers and clients may not be compatible with older versions”
> in
> > Kafka 0.9 documentation worries me). It would be great if you guys could
> do
> > some sanity testing with Kafka 0.9 broker and see if there are any issues
> > using Samza 0.10, as you guys are the experts in the field and we will
> not
> > be able to identify all the use cases Samza using Kafka for. Ex: Tools
> > packaged under *org.apache.kafka.clients.tools.** have been moved to
> > *org.apache.kafka.tools.*(I suppose this is only for script), bunch of
> > Kafka configurations that got deprecated that might affect? and other
> > potential
> > breaking changes
> > <http://kafka.apache.org/documentation.html#upgrade_9_breaking>. It
> would
> > be really helpful for this community if Samza team could confirm that
> Kafka
> > 0.9 could be safely used with 0.10 or 0.10.1 version.
> >
> >
> >
> > Thanks,
> >
> > Nick
> > ------------------------------
> >
> >
> > Hi, Nick,
> >
> >
> >
> > Thanks for digging out the details from KAFKA JIRAs! I appreciated it!*
> >
> >
> >
> > As for upgrading to Kafka 0.9 to fix those critical issues, I am totally
> w/
> >
> > you. The discussion on whether Samza 0.10.1 should include Kafka 0.9
> fixes
> >
> > or not has just started (by your thread :)). So, we are happy to
> >
> > accommodate the request if the community has need for that.
> >
> >
> >
> > As for LinkedIn deployment, we actually have already deployed an internal
> >
> > version of Kafka that has most of the 0.9 fixes for log-compaction w/
> >
> > compressed messages. I will need to check w/ our Kafka team to see
> whether
> >
> > the bugs you mentioned also is included. There is a bit concern on
> pushing
> >
> > out Kafka 0.9 (with client libs) to Samza 0.10.1 due to the fact that
> some
> >
> > of the community members are still running Kafka 0.8.2 brokers in their
> >
> > production and this change might incur some migration cost. Besides,
> Kafka
> >
> > 0.9 also introduces a new client library changes that requires code
> change
> >
> > in Samza's KafkaSystemConsumer/KafkaSystemProducer. Hence, our original
> >
> > thought is to keep Samza 0.10.1 as a light-weighted release and
> incorporate
> >
> > Kafka 0.9 in the next major release.
> >
> >
> >
> > However, if Kafka 0.9 brokers are supporting Kafka 0.8.2 clients, I don't
> >
> > think that it should block you from using Kafka 0.9 broker and Samza 0.10
> >
> > together to fix the server side issues you mentioned. If there any
> >
> > client-side change in Samza that is needed, we are happy to help and if
> >
> > necessary, we can also change the scope of Samza 0.10.1 to include Kafka
> >
> > 0.9 client libraries.
> >
> >
> >
> > Please let me know if the above works for you. If not, let me know the
> >
> > specific issues that we need to use Kafka 0.9 client and we can find a
> >
> > solution together.
> >
> >
> >
> > Thanks a lot!
> >
> >
> >
> > -Yi
> >
> >
> >
> > On Fri, Apr 1, 2016 at 3:34 PM, nick xander <ni...@gmail.com>
> > wrote:
> >
> >
> >
> > > Hi Yi,
> >
> > >
> >
> > >         Thanks for the clarification, it was helpful.
> >
> > >
> >
> > >
> >
> > > I would also like to know your views on the below issues and if you
> have
> >
> > > employed something to overcome those.
> >
> > >
> >
> > > LogCompaction Issues:
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager
> cache
> >
> > > should prevent stale-offset-cleanup while an offset load is in
> progress;
> >
> > > otherwise we can lose consumer offsets – *Might be an issue as it will
> >
> > > result in no offset to be read thereby failing the bootstrap of local
> key
> >
> > > value store*
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot
> clean
> >
> > > after shutdown during replaceSegments –
> >
> > > *Will prevent reading log compacted topic causing failure of local key
> >
> > > value store bootstrap*
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset
> map
> >
> > > overflow –
> >
> > > *Will probably be an issue for some clients who has smaller  message
> size
> >
> > > and large number of keys. They need to fine tune a lot to make sure
> that
> >
> > > this doesn't happen.*
> >
> > >
> >
> > >
> >
> > >
> >
> > > Replication Issues:
> >
> > >
> >
> > > https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously
> >
> > > deleting all segments in partition –
> >
> > > *Will cause the data in changelog topic to be lost resulting in failure
> > of
> >
> > > local key value store bootstrap. *
> >
> > >
> >
> > >
> >
> > > Though Samza can be plugged with different messaging systems, Kafka is
> > the
> >
> > > major system that is supported today for state-full processing. If
> that's
> >
> > > the case the following bugs will potentially make Samza also to not
> work
> >
> > > properly (Ex: if there is replication issue called out below in a log
> >
> > > compacted topic happens, then Samza might not be able to restore its
> > local
> >
> > > key value store).. Since you are running Samza with state-full
> > processing,
> >
> > > the above issues might result your Samza job with key value store in an
> >
> > > in-consistent state. Are you using Samza with stateful processing for
> >
> > > critical applications which cannot tolerate loss of data or
> >
> > > inconsistencies? (Because with the above bugs you might not be able to
> > run
> >
> > > the job for critical application as it might fail if it is hit with the
> >
> > > above issues). I believe that upgrading to 0.9 Kafka is much critical
> to
> >
> > > ensure that Samza also works properly (I do understand that its not a
> > issue
> >
> > > with Samza, but I believe that the one of the primary reason for
> >
> > > customers/devs choosing Samza is its fine ability to do state-full
> >
> > > processing and if that is not working or will fail due to dependency on
> >
> > > Kafka, it becomes necessary to upgrade to Kafka asap), please correct
> me
> > if
> >
> > > I am wrong here.
> >
> > >
> >
> > >
> >
> > > Thanks,
> >
> > >
> >
> > > Nick
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > ------------------------------
> >
> > >
> >
> > >
> >
> > >
> >
> > > Hi, Nick,
> >
> > >
> >
> > >
> >
> > >
> >
> > > Let me try to answer in-between the lines:
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Mar 31, 2016 at 12:49 AM, nick xander <nickxander123@gmail.com
> >
> >
> > >
> >
> > > wrote:
> >
> > >
> >
> > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > > * Do you guys experience issue with Kafka when it is used with log
> >
> > >
> >
> > > > compaction for Samza's state full management?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > The critical issue on log-compaction in Kafka that we care about is the
> >
> > >
> >
> > > case where message compression and log-compaction are *both* used in
> the
> >
> > >
> >
> > > same topic. Currently, for changelog topics, we forcefully turned off
> >
> > >
> >
> > > compression. Hence, it is not a problem for Samza's KV-stores. It is
> > still
> >
> > >
> >
> > > a problem for checkpoint topics if the Kafka producer is configured to
> > use
> >
> > >
> >
> > > message compression.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * What is the avg number of keys per partition that you have observed
> > in
> >
> > >
> >
> > > > Kafka's log compacted topic for state full management, total number
> of
> >
> > >
> >
> > > > partition, replication factor and number of Kafka brokers?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > This number varies *a lot*, depending on how big your KV-store is. For
> >
> > >
> >
> > > example, we have seem around 5-10GB of RocksDB KV-stores being stored
> in
> >
> > >
> >
> > > changelog in LinkedIn. That will cause a long bootstrap time when the
> >
> > >
> >
> > > container is restarted on a different host. Hence, we included
> >
> > >
> >
> > > host-affinity feature in Samza 0.10, which cut down the bootstrap time
> > for
> >
> > >
> >
> > > that particular job by 20x.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as
> it
> >
> > >
> >
> > > > seems critical if Samza is used for stateful management? And what is
> > the
> >
> > >
> >
> > > > timeline for Samza 0.10.1 that you are expecting?
> >
> > >
> >
> > > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > We are planning to release Samza 0.10.1 very soon and are working on
> >
> > >
> >
> > > pending code reviews and validations now. Depending on the
> > test/validation
> >
> > >
> >
> > > cycles, we hope to get Samza 0.10.1 release candidate ready in a month
> or
> >
> > >
> >
> > > so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the
> > tight
> >
> > >
> >
> > > release timeline this time.
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > >
> >
> > > > * What is recommendation between the usage of Samza vs Kafka connect?
> >
> > >
> >
> > > > Should we use Samza for state full management and Kafka connect for
> > other
> >
> > >
> >
> > > > stateless streaming soslution?
> >
> > >
> >
> > > >
> >
> > >
> >
> > > >
> >
> > >
> >
> > > KafkaConnect is mainly an ingest/output connector to/from Kafka, not
> > having
> >
> > >
> >
> > > much stateful processing. Samza actually does both ingest/output and
> >
> > >
> >
> > > stateful process. If there are input data sources that Samza does not
> > have
> >
> > >
> >
> > > a SystemConsumer implementation for yet, you can definitely use
> >
> > >
> >
> > > KafkaConnect for ingestion and Samza for stateful processing.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Hope the above answered your questions.
> >
> > >
> >
> > >
> >
> > >
> >
> > > Thanks!
> >
> > >
> >
> > >
> >
> > >
> >
> > > -Yi
> >
> > >
> >
> > >
> >
> > >
> >
> > > On Thu, Mar 31, 2016 at 9:49 AM, nick xander <ni...@gmail.com>
> >
> > > wrote:
> >
> > >
> >
> > > > Hi All,
> >
> > > >     As per this article:
> >
> > > >
> >
> > >
> >
> http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0
> >
> > > > there are some well know bugs and feature improvements around log
> >
> > > > compaction (state full management in Samza) and Replication. I also
> saw
> >
> > > in
> >
> > > > Samza issues about this upgrade:
> >
> > > > https://issues.apache.org/jira/browse/SAMZA-855. My questions here:
> >
> > > >
> >
> > > > * Do you guys experience issue with Kafka when it is used with log
> >
> > > > compaction for Samza's state full management?
> >
> > > > * What is the avg number of keys per partition that you have observed
> > in
> >
> > > > Kafka's log compacted topic for state full management, total number
> of
> >
> > > > partition, replication factor and number of Kafka brokers?
> >
> > > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as
> it
> >
> > > > seems critical if Samza is used for stateful management? And what is
> > the
> >
> > > > timeline for Samza 0.10.1 that you are expecting?
> >
> > > > * What is recommendation between the usage of Samza vs Kafka connect?
> >
> > > > Should we use Samza for state full management and Kafka connect for
> > other
> >
> > > > stateless streaming soslution?
> >
> > > >
> >
> > > > Thanks,
> >
> > > > Nick
> >
> > > >
> >
> > >
> >
>



-- 
Thanks,
Neha

Re: Kafka 0.9 as part of Samza 0.10?

Posted by Kartik Paramasivam <kp...@linkedin.com.INVALID>.
The log compaction fixes in Kafka 0.9 were done by our linkedIn Kafka
developers to fix issues faced by Samza.

So Yes.. Samza 0.10 can be safely used with Kafka 0.9.
At LinkedIn we currently run kafka broker from apache kafka trunk.. So it
is a more recent Kafka version.  But that should be fine.

Regarding Kafka Connect : We don't use it at LinkedIn.  So we don't have
any position on this.   We do have several Samza jobs at linkedIn which
have implemented system consumers to read from Kinesis and Dynamo DB
streams and publish to Kafka.   We also have been working on a new version
of Databus which will be built on top of Kafka (it is similar to Kafka
connect)


On Thu, Apr 7, 2016 at 3:51 PM, nick xander <ni...@gmail.com> wrote:

> Hi Yi,
>
>         Thanks for the support, really appreciate it to have an
> active/supportive community. Makes sense to not upgrade Samza to use Kafka
> 0.9 new client(which doesn’t rely on Zookeeper) because it might break the
> clients using Kafka 0.8.2 broker. But as you were saying we might be able
> to use 0.9 Broker with Samza 0.10 (Can you please confirm this? I tried
> going through different documentation, seems possible. But “This means that
> upgraded brokers and clients may not be compatible with older versions” in
> Kafka 0.9 documentation worries me). It would be great if you guys could do
> some sanity testing with Kafka 0.9 broker and see if there are any issues
> using Samza 0.10, as you guys are the experts in the field and we will not
> be able to identify all the use cases Samza using Kafka for. Ex: Tools
> packaged under *org.apache.kafka.clients.tools.** have been moved to
> *org.apache.kafka.tools.*(I suppose this is only for script), bunch of
> Kafka configurations that got deprecated that might affect? and other
> potential
> breaking changes
> <http://kafka.apache.org/documentation.html#upgrade_9_breaking>. It would
> be really helpful for this community if Samza team could confirm that Kafka
> 0.9 could be safely used with 0.10 or 0.10.1 version.
>
>
>
> Thanks,
>
> Nick
> ------------------------------
>
>
> Hi, Nick,
>
>
>
> Thanks for digging out the details from KAFKA JIRAs! I appreciated it!*
>
>
>
> As for upgrading to Kafka 0.9 to fix those critical issues, I am totally w/
>
> you. The discussion on whether Samza 0.10.1 should include Kafka 0.9 fixes
>
> or not has just started (by your thread :)). So, we are happy to
>
> accommodate the request if the community has need for that.
>
>
>
> As for LinkedIn deployment, we actually have already deployed an internal
>
> version of Kafka that has most of the 0.9 fixes for log-compaction w/
>
> compressed messages. I will need to check w/ our Kafka team to see whether
>
> the bugs you mentioned also is included. There is a bit concern on pushing
>
> out Kafka 0.9 (with client libs) to Samza 0.10.1 due to the fact that some
>
> of the community members are still running Kafka 0.8.2 brokers in their
>
> production and this change might incur some migration cost. Besides, Kafka
>
> 0.9 also introduces a new client library changes that requires code change
>
> in Samza's KafkaSystemConsumer/KafkaSystemProducer. Hence, our original
>
> thought is to keep Samza 0.10.1 as a light-weighted release and incorporate
>
> Kafka 0.9 in the next major release.
>
>
>
> However, if Kafka 0.9 brokers are supporting Kafka 0.8.2 clients, I don't
>
> think that it should block you from using Kafka 0.9 broker and Samza 0.10
>
> together to fix the server side issues you mentioned. If there any
>
> client-side change in Samza that is needed, we are happy to help and if
>
> necessary, we can also change the scope of Samza 0.10.1 to include Kafka
>
> 0.9 client libraries.
>
>
>
> Please let me know if the above works for you. If not, let me know the
>
> specific issues that we need to use Kafka 0.9 client and we can find a
>
> solution together.
>
>
>
> Thanks a lot!
>
>
>
> -Yi
>
>
>
> On Fri, Apr 1, 2016 at 3:34 PM, nick xander <ni...@gmail.com>
> wrote:
>
>
>
> > Hi Yi,
>
> >
>
> >         Thanks for the clarification, it was helpful.
>
> >
>
> >
>
> > I would also like to know your views on the below issues and if you have
>
> > employed something to overcome those.
>
> >
>
> > LogCompaction Issues:
>
> >
>
> > https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager cache
>
> > should prevent stale-offset-cleanup while an offset load is in progress;
>
> > otherwise we can lose consumer offsets – *Might be an issue as it will
>
> > result in no offset to be read thereby failing the bootstrap of local key
>
> > value store*
>
> >
>
> > https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot clean
>
> > after shutdown during replaceSegments –
>
> > *Will prevent reading log compacted topic causing failure of local key
>
> > value store bootstrap*
>
> >
>
> > https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset map
>
> > overflow –
>
> > *Will probably be an issue for some clients who has smaller  message size
>
> > and large number of keys. They need to fine tune a lot to make sure that
>
> > this doesn't happen.*
>
> >
>
> >
>
> >
>
> > Replication Issues:
>
> >
>
> > https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously
>
> > deleting all segments in partition –
>
> > *Will cause the data in changelog topic to be lost resulting in failure
> of
>
> > local key value store bootstrap. *
>
> >
>
> >
>
> > Though Samza can be plugged with different messaging systems, Kafka is
> the
>
> > major system that is supported today for state-full processing. If that's
>
> > the case the following bugs will potentially make Samza also to not work
>
> > properly (Ex: if there is replication issue called out below in a log
>
> > compacted topic happens, then Samza might not be able to restore its
> local
>
> > key value store).. Since you are running Samza with state-full
> processing,
>
> > the above issues might result your Samza job with key value store in an
>
> > in-consistent state. Are you using Samza with stateful processing for
>
> > critical applications which cannot tolerate loss of data or
>
> > inconsistencies? (Because with the above bugs you might not be able to
> run
>
> > the job for critical application as it might fail if it is hit with the
>
> > above issues). I believe that upgrading to 0.9 Kafka is much critical to
>
> > ensure that Samza also works properly (I do understand that its not a
> issue
>
> > with Samza, but I believe that the one of the primary reason for
>
> > customers/devs choosing Samza is its fine ability to do state-full
>
> > processing and if that is not working or will fail due to dependency on
>
> > Kafka, it becomes necessary to upgrade to Kafka asap), please correct me
> if
>
> > I am wrong here.
>
> >
>
> >
>
> > Thanks,
>
> >
>
> > Nick
>
> >
>
> >
>
> >
>
> >
>
> > ------------------------------
>
> >
>
> >
>
> >
>
> > Hi, Nick,
>
> >
>
> >
>
> >
>
> > Let me try to answer in-between the lines:
>
> >
>
> >
>
> >
>
> > On Thu, Mar 31, 2016 at 12:49 AM, nick xander <ni...@gmail.com>
>
> >
>
> > wrote:
>
> >
>
> >
>
> >
>
> > >
>
> >
>
> > > * Do you guys experience issue with Kafka when it is used with log
>
> >
>
> > > compaction for Samza's state full management?
>
> >
>
> > >
>
> >
>
> >
>
> >
>
> > The critical issue on log-compaction in Kafka that we care about is the
>
> >
>
> > case where message compression and log-compaction are *both* used in the
>
> >
>
> > same topic. Currently, for changelog topics, we forcefully turned off
>
> >
>
> > compression. Hence, it is not a problem for Samza's KV-stores. It is
> still
>
> >
>
> > a problem for checkpoint topics if the Kafka producer is configured to
> use
>
> >
>
> > message compression.
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > > * What is the avg number of keys per partition that you have observed
> in
>
> >
>
> > > Kafka's log compacted topic for state full management, total number of
>
> >
>
> > > partition, replication factor and number of Kafka brokers?
>
> >
>
> > >
>
> >
>
> >
>
> >
>
> > This number varies *a lot*, depending on how big your KV-store is. For
>
> >
>
> > example, we have seem around 5-10GB of RocksDB KV-stores being stored in
>
> >
>
> > changelog in LinkedIn. That will cause a long bootstrap time when the
>
> >
>
> > container is restarted on a different host. Hence, we included
>
> >
>
> > host-affinity feature in Samza 0.10, which cut down the bootstrap time
> for
>
> >
>
> > that particular job by 20x.
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it
>
> >
>
> > > seems critical if Samza is used for stateful management? And what is
> the
>
> >
>
> > > timeline for Samza 0.10.1 that you are expecting?
>
> >
>
> > >
>
> >
>
> >
>
> >
>
> > We are planning to release Samza 0.10.1 very soon and are working on
>
> >
>
> > pending code reviews and validations now. Depending on the
> test/validation
>
> >
>
> > cycles, we hope to get Samza 0.10.1 release candidate ready in a month or
>
> >
>
> > so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the
> tight
>
> >
>
> > release timeline this time.
>
> >
>
> >
>
> >
>
> >
>
> >
>
> > > * What is recommendation between the usage of Samza vs Kafka connect?
>
> >
>
> > > Should we use Samza for state full management and Kafka connect for
> other
>
> >
>
> > > stateless streaming soslution?
>
> >
>
> > >
>
> >
>
> > >
>
> >
>
> > KafkaConnect is mainly an ingest/output connector to/from Kafka, not
> having
>
> >
>
> > much stateful processing. Samza actually does both ingest/output and
>
> >
>
> > stateful process. If there are input data sources that Samza does not
> have
>
> >
>
> > a SystemConsumer implementation for yet, you can definitely use
>
> >
>
> > KafkaConnect for ingestion and Samza for stateful processing.
>
> >
>
> >
>
> >
>
> > Hope the above answered your questions.
>
> >
>
> >
>
> >
>
> > Thanks!
>
> >
>
> >
>
> >
>
> > -Yi
>
> >
>
> >
>
> >
>
> > On Thu, Mar 31, 2016 at 9:49 AM, nick xander <ni...@gmail.com>
>
> > wrote:
>
> >
>
> > > Hi All,
>
> > >     As per this article:
>
> > >
>
> >
> http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0
>
> > > there are some well know bugs and feature improvements around log
>
> > > compaction (state full management in Samza) and Replication. I also saw
>
> > in
>
> > > Samza issues about this upgrade:
>
> > > https://issues.apache.org/jira/browse/SAMZA-855. My questions here:
>
> > >
>
> > > * Do you guys experience issue with Kafka when it is used with log
>
> > > compaction for Samza's state full management?
>
> > > * What is the avg number of keys per partition that you have observed
> in
>
> > > Kafka's log compacted topic for state full management, total number of
>
> > > partition, replication factor and number of Kafka brokers?
>
> > > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it
>
> > > seems critical if Samza is used for stateful management? And what is
> the
>
> > > timeline for Samza 0.10.1 that you are expecting?
>
> > > * What is recommendation between the usage of Samza vs Kafka connect?
>
> > > Should we use Samza for state full management and Kafka connect for
> other
>
> > > stateless streaming soslution?
>
> > >
>
> > > Thanks,
>
> > > Nick
>
> > >
>
> >
>

Re: Kafka 0.9 as part of Samza 0.10?

Posted by nick xander <ni...@gmail.com>.
Hi Yi,

        Thanks for the support, really appreciate it to have an
active/supportive community. Makes sense to not upgrade Samza to use Kafka
0.9 new client(which doesn’t rely on Zookeeper) because it might break the
clients using Kafka 0.8.2 broker. But as you were saying we might be able
to use 0.9 Broker with Samza 0.10 (Can you please confirm this? I tried
going through different documentation, seems possible. But “This means that
upgraded brokers and clients may not be compatible with older versions” in
Kafka 0.9 documentation worries me). It would be great if you guys could do
some sanity testing with Kafka 0.9 broker and see if there are any issues
using Samza 0.10, as you guys are the experts in the field and we will not
be able to identify all the use cases Samza using Kafka for. Ex: Tools
packaged under *org.apache.kafka.clients.tools.** have been moved to
*org.apache.kafka.tools.*(I suppose this is only for script), bunch of
Kafka configurations that got deprecated that might affect? and other potential
breaking changes
<http://kafka.apache.org/documentation.html#upgrade_9_breaking>. It would
be really helpful for this community if Samza team could confirm that Kafka
0.9 could be safely used with 0.10 or 0.10.1 version.



Thanks,

Nick
------------------------------


Hi, Nick,



Thanks for digging out the details from KAFKA JIRAs! I appreciated it!*



As for upgrading to Kafka 0.9 to fix those critical issues, I am totally w/

you. The discussion on whether Samza 0.10.1 should include Kafka 0.9 fixes

or not has just started (by your thread :)). So, we are happy to

accommodate the request if the community has need for that.



As for LinkedIn deployment, we actually have already deployed an internal

version of Kafka that has most of the 0.9 fixes for log-compaction w/

compressed messages. I will need to check w/ our Kafka team to see whether

the bugs you mentioned also is included. There is a bit concern on pushing

out Kafka 0.9 (with client libs) to Samza 0.10.1 due to the fact that some

of the community members are still running Kafka 0.8.2 brokers in their

production and this change might incur some migration cost. Besides, Kafka

0.9 also introduces a new client library changes that requires code change

in Samza's KafkaSystemConsumer/KafkaSystemProducer. Hence, our original

thought is to keep Samza 0.10.1 as a light-weighted release and incorporate

Kafka 0.9 in the next major release.



However, if Kafka 0.9 brokers are supporting Kafka 0.8.2 clients, I don't

think that it should block you from using Kafka 0.9 broker and Samza 0.10

together to fix the server side issues you mentioned. If there any

client-side change in Samza that is needed, we are happy to help and if

necessary, we can also change the scope of Samza 0.10.1 to include Kafka

0.9 client libraries.



Please let me know if the above works for you. If not, let me know the

specific issues that we need to use Kafka 0.9 client and we can find a

solution together.



Thanks a lot!



-Yi



On Fri, Apr 1, 2016 at 3:34 PM, nick xander <ni...@gmail.com> wrote:



> Hi Yi,

>

>         Thanks for the clarification, it was helpful.

>

>

> I would also like to know your views on the below issues and if you have

> employed something to overcome those.

>

> LogCompaction Issues:

>

> https://issues.apache.org/jira/browse/KAFKA-2163 - Offsets manager cache

> should prevent stale-offset-cleanup while an offset load is in progress;

> otherwise we can lose consumer offsets – *Might be an issue as it will

> result in no offset to be read thereby failing the bootstrap of local key

> value store*

>

> https://issues.apache.org/jira/browse/KAFKA-2118 - Cleaner cannot clean

> after shutdown during replaceSegments –

> *Will prevent reading log compacted topic causing failure of local key

> value store bootstrap*

>

> https://issues.apache.org/jira/browse/KAFKA-2235 - LogCleaner offset map

> overflow –

> *Will probably be an issue for some clients who has smaller  message size

> and large number of keys. They need to fine tune a lot to make sure that

> this doesn't happen.*

>

>

>

> Replication Issues:

>

> https://issues.apache.org/jira/browse/KAFKA-2477 - Replicas spuriously

> deleting all segments in partition –

> *Will cause the data in changelog topic to be lost resulting in failure of

> local key value store bootstrap. *

>

>

> Though Samza can be plugged with different messaging systems, Kafka is the

> major system that is supported today for state-full processing. If that's

> the case the following bugs will potentially make Samza also to not work

> properly (Ex: if there is replication issue called out below in a log

> compacted topic happens, then Samza might not be able to restore its local

> key value store).. Since you are running Samza with state-full processing,

> the above issues might result your Samza job with key value store in an

> in-consistent state. Are you using Samza with stateful processing for

> critical applications which cannot tolerate loss of data or

> inconsistencies? (Because with the above bugs you might not be able to run

> the job for critical application as it might fail if it is hit with the

> above issues). I believe that upgrading to 0.9 Kafka is much critical to

> ensure that Samza also works properly (I do understand that its not a
issue

> with Samza, but I believe that the one of the primary reason for

> customers/devs choosing Samza is its fine ability to do state-full

> processing and if that is not working or will fail due to dependency on

> Kafka, it becomes necessary to upgrade to Kafka asap), please correct me
if

> I am wrong here.

>

>

> Thanks,

>

> Nick

>

>

>

>

> ------------------------------

>

>

>

> Hi, Nick,

>

>

>

> Let me try to answer in-between the lines:

>

>

>

> On Thu, Mar 31, 2016 at 12:49 AM, nick xander <ni...@gmail.com>

>

> wrote:

>

>

>

> >

>

> > * Do you guys experience issue with Kafka when it is used with log

>

> > compaction for Samza's state full management?

>

> >

>

>

>

> The critical issue on log-compaction in Kafka that we care about is the

>

> case where message compression and log-compaction are *both* used in the

>

> same topic. Currently, for changelog topics, we forcefully turned off

>

> compression. Hence, it is not a problem for Samza's KV-stores. It is still

>

> a problem for checkpoint topics if the Kafka producer is configured to use

>

> message compression.

>

>

>

>

>

> > * What is the avg number of keys per partition that you have observed in

>

> > Kafka's log compacted topic for state full management, total number of

>

> > partition, replication factor and number of Kafka brokers?

>

> >

>

>

>

> This number varies *a lot*, depending on how big your KV-store is. For

>

> example, we have seem around 5-10GB of RocksDB KV-stores being stored in

>

> changelog in LinkedIn. That will cause a long bootstrap time when the

>

> container is restarted on a different host. Hence, we included

>

> host-affinity feature in Samza 0.10, which cut down the bootstrap time for

>

> that particular job by 20x.

>

>

>

>

>

> > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it

>

> > seems critical if Samza is used for stateful management? And what is the

>

> > timeline for Samza 0.10.1 that you are expecting?

>

> >

>

>

>

> We are planning to release Samza 0.10.1 very soon and are working on

>

> pending code reviews and validations now. Depending on the test/validation

>

> cycles, we hope to get Samza 0.10.1 release candidate ready in a month or

>

> so. Kafka 0.9 upgrade will likely not be in Samza 0.10.1, due to the tight

>

> release timeline this time.

>

>

>

>

>

> > * What is recommendation between the usage of Samza vs Kafka connect?

>

> > Should we use Samza for state full management and Kafka connect for
other

>

> > stateless streaming soslution?

>

> >

>

> >

>

> KafkaConnect is mainly an ingest/output connector to/from Kafka, not
having

>

> much stateful processing. Samza actually does both ingest/output and

>

> stateful process. If there are input data sources that Samza does not have

>

> a SystemConsumer implementation for yet, you can definitely use

>

> KafkaConnect for ingestion and Samza for stateful processing.

>

>

>

> Hope the above answered your questions.

>

>

>

> Thanks!

>

>

>

> -Yi

>

>

>

> On Thu, Mar 31, 2016 at 9:49 AM, nick xander <ni...@gmail.com>

> wrote:

>

> > Hi All,

> >     As per this article:

> >

>
http://www.confluent.io/blog/290-reasons-to-upgrade-to-apache-kafka-0.9.0.0

> > there are some well know bugs and feature improvements around log

> > compaction (state full management in Samza) and Replication. I also saw

> in

> > Samza issues about this upgrade:

> > https://issues.apache.org/jira/browse/SAMZA-855. My questions here:

> >

> > * Do you guys experience issue with Kafka when it is used with log

> > compaction for Samza's state full management?

> > * What is the avg number of keys per partition that you have observed in

> > Kafka's log compacted topic for state full management, total number of

> > partition, replication factor and number of Kafka brokers?

> > * Will Kafka 0.9 upgrade will be included as part of Samza 0.10.1 as it

> > seems critical if Samza is used for stateful management? And what is the

> > timeline for Samza 0.10.1 that you are expecting?

> > * What is recommendation between the usage of Samza vs Kafka connect?

> > Should we use Samza for state full management and Kafka connect for
other

> > stateless streaming soslution?

> >

> > Thanks,

> > Nick

> >

>