You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Michael Sklyar <mi...@gmail.com> on 2015/12/28 18:19:51 UTC

Is it possible to ensure that the same key in different kafka topics will always reach the same Samza Task instance?

Hi,


I have a question regarding Kafka partitions while working with RocksDB as
an enrichment cache.


We have a stream of URLs, a very simplified version would be:

(1)URL(some 24 partitions)->(2)read enrichments task (from RocksDB)
->(3)make decision


One of the enrichments is counter which should be accurate, to achieve it
we partition the input Kafka topic (1) by key (therefore same URL will
always arrive to the same task instance and the counter will be correct).

For other enrichments (for example web title, google page rank…) we have
other tasks that write to additional Kafka topics, also consumed by (2). Is
it possible to make sure that the same key in different kafka topics will
reach the same Samza task instance?

Other option, of course, would be to hold all the enrichments in all
RocksDB instances.



What do you think? What is the best practice?



Thanks,

Michael Sklyar

Re: Is it possible to ensure that the same key in different kafka topics will always reach the same Samza Task instance?

Posted by Michael Sklyar <mi...@gmail.com>.
Thanks!
That's exactly the desired behavior.

On Tue, Dec 29, 2015 at 6:10 AM, Jagadish Venkatraman <
jagadish1989@gmail.com> wrote:

> Hi Michael,
>
> Similar keys in different topics will be routed to the same task instance
> by default (Assuming that the keys are present in the same partition-id in
> both topics - ie, the topics have the same # of partitions, and the topics
> are keyed by the same key field).
>
> The default behavior is to group topic-partitions by partition_id. Please
> refer the property* job.systemstreampartition.**grouper.factory* from
> <http://goog_1566468382>
>
> http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html
> <
> http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html
> >
>  .
>
>
>
> On Mon, Dec 28, 2015 at 9:19 AM, Michael Sklyar <mi...@gmail.com>
> wrote:
>
> > Hi,
> >
> >
> > I have a question regarding Kafka partitions while working with RocksDB
> as
> > an enrichment cache.
> >
> >
> > We have a stream of URLs, a very simplified version would be:
> >
> > (1)URL(some 24 partitions)->(2)read enrichments task (from RocksDB)
> > ->(3)make decision
> >
> >
> > One of the enrichments is counter which should be accurate, to achieve it
> > we partition the input Kafka topic (1) by key (therefore same URL will
> > always arrive to the same task instance and the counter will be correct).
> >
> > For other enrichments (for example web title, google page rank…) we have
> > other tasks that write to additional Kafka topics, also consumed by (2).
> Is
> > it possible to make sure that the same key in different kafka topics will
> > reach the same Samza task instance?
> >
> > Other option, of course, would be to hold all the enrichments in all
> > RocksDB instances.
> >
> >
> >
> > What do you think? What is the best practice?
> >
> >
> >
> > Thanks,
> >
> > Michael Sklyar
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>

Re: Is it possible to ensure that the same key in different kafka topics will always reach the same Samza Task instance?

Posted by Jagadish Venkatraman <ja...@gmail.com>.
Hi Michael,

Similar keys in different topics will be routed to the same task instance
by default (Assuming that the keys are present in the same partition-id in
both topics - ie, the topics have the same # of partitions, and the topics
are keyed by the same key field).

The default behavior is to group topic-partitions by partition_id. Please
refer the property* job.systemstreampartition.**grouper.factory* from
<http://goog_1566468382>
http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html
<http://samza.apache.org/learn/documentation/0.9/jobs/configuration-table.html>
 .



On Mon, Dec 28, 2015 at 9:19 AM, Michael Sklyar <mi...@gmail.com> wrote:

> Hi,
>
>
> I have a question regarding Kafka partitions while working with RocksDB as
> an enrichment cache.
>
>
> We have a stream of URLs, a very simplified version would be:
>
> (1)URL(some 24 partitions)->(2)read enrichments task (from RocksDB)
> ->(3)make decision
>
>
> One of the enrichments is counter which should be accurate, to achieve it
> we partition the input Kafka topic (1) by key (therefore same URL will
> always arrive to the same task instance and the counter will be correct).
>
> For other enrichments (for example web title, google page rank…) we have
> other tasks that write to additional Kafka topics, also consumed by (2). Is
> it possible to make sure that the same key in different kafka topics will
> reach the same Samza task instance?
>
> Other option, of course, would be to hold all the enrichments in all
> RocksDB instances.
>
>
>
> What do you think? What is the best practice?
>
>
>
> Thanks,
>
> Michael Sklyar
>



-- 
Jagadish V,
Graduate Student,
Department of Computer Science,
Stanford University