You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Andy Chambers <ac...@gmail.com> on 2016/09/03 01:43:45 UTC

Partitioning at the edges

Hey Folks,

We are having quite a bit trouble modelling the flow of data through a very
kafka centric system

As I understand it, every stream you might want to join with another must
be partitioned the same way. But often streams at the edges of a system
*cannot* be partitioned the same way because they don't have the partition
key yet (often the work for this process is to find the key in some lookup
table based on some other key we don't control).

We have come up with a few solutions but everything seems to add complexity
and backs our designs into a corner.

What is frustrating is that most of the data is not really that big but we
have a handful of topics we expect to require a lot of throughput.

Is this just unavoidable complexity asociated with scale or am I thinking
about this in the wrong way. We're going all in on the "turning the
database inside out" architecture but we end up spending more time thinking
about how stuff gets broken up into tasks and distributed than we are about
our business.

Do these problems seem familiar to anyone else?  Did you find any patterns
that helped keep the complexity down.

Cheers

Re: Partitioning at the edges

Posted by Andy Chambers <ac...@gmail.com>.
Looks like re-partitioning is probably the way to go. I've seen reference
to this pattern a couple of times but wanted to make sure I wasn't missing
something obvious. Looks like kafka streams makes this kind of thing a bit
easier than samza.

Thanks for sharing your wisdom folks :-)



On Wed, Sep 7, 2016 at 1:43 PM, David Garcia <da...@spiceworks.com> wrote:

> Obviously for the keys you don’t have, you would have to look them
> up…sorry, I kinda missed that part.  That is indeed a pain.  The job that
> looks those keys up would probably have to batch queries to the external
> system.  Maybe you could use kafka-connect-jdbc to stream in updates to
> that system?
>
> -David
>
>
> On 9/7/16, 3:41 PM, "David Garcia" <da...@spiceworks.com> wrote:
>
>     The “simplest” way to solve this is to “repartition” your data (i.e.
> the streams you wish to join) with the partition key you wish to join on.
> This obviously introduces redundancy, but it will solve your problem.  For
> example.. suppose you want to join topic T1 and topic T2…but they aren’t
> partitioned on the key you need.  You could write two “simple” repartition
> jobs for each topic (you can actually do this with one job):
>
>     T1 -> Job_T1 -> T1’
>     T2 -> Job_T2 -> T2’
>
>     T1’ and T2’ would be partitioned on your join key and would have the
> same number of partitions so that you have the guarantees you need to do
> the join.  (i.e. join T1’ and T2’).
>
>     -David
>
>
>     On 9/2/16, 8:43 PM, "Andy Chambers" <ac...@gmail.com> wrote:
>
>         Hey Folks,
>
>         We are having quite a bit trouble modelling the flow of data
> through a very
>         kafka centric system
>
>         As I understand it, every stream you might want to join with
> another must
>         be partitioned the same way. But often streams at the edges of a
> system
>         *cannot* be partitioned the same way because they don't have the
> partition
>         key yet (often the work for this process is to find the key in
> some lookup
>         table based on some other key we don't control).
>
>         We have come up with a few solutions but everything seems to add
> complexity
>         and backs our designs into a corner.
>
>         What is frustrating is that most of the data is not really that
> big but we
>         have a handful of topics we expect to require a lot of throughput.
>
>         Is this just unavoidable complexity asociated with scale or am I
> thinking
>         about this in the wrong way. We're going all in on the "turning the
>         database inside out" architecture but we end up spending more time
> thinking
>         about how stuff gets broken up into tasks and distributed than we
> are about
>         our business.
>
>         Do these problems seem familiar to anyone else?  Did you find any
> patterns
>         that helped keep the complexity down.
>
>         Cheers
>
>
>
>
>
>
>

Re: Partitioning at the edges

Posted by David Garcia <da...@spiceworks.com>.
Obviously for the keys you don’t have, you would have to look them up…sorry, I kinda missed that part.  That is indeed a pain.  The job that looks those keys up would probably have to batch queries to the external system.  Maybe you could use kafka-connect-jdbc to stream in updates to that system?

-David


On 9/7/16, 3:41 PM, "David Garcia" <da...@spiceworks.com> wrote:

    The “simplest” way to solve this is to “repartition” your data (i.e. the streams you wish to join) with the partition key you wish to join on.  This obviously introduces redundancy, but it will solve your problem.  For example.. suppose you want to join topic T1 and topic T2…but they aren’t partitioned on the key you need.  You could write two “simple” repartition jobs for each topic (you can actually do this with one job):
    
    T1 -> Job_T1 -> T1’
    T2 -> Job_T2 -> T2’
    
    T1’ and T2’ would be partitioned on your join key and would have the same number of partitions so that you have the guarantees you need to do the join.  (i.e. join T1’ and T2’).
    
    -David
    
    
    On 9/2/16, 8:43 PM, "Andy Chambers" <ac...@gmail.com> wrote:
    
        Hey Folks,
        
        We are having quite a bit trouble modelling the flow of data through a very
        kafka centric system
        
        As I understand it, every stream you might want to join with another must
        be partitioned the same way. But often streams at the edges of a system
        *cannot* be partitioned the same way because they don't have the partition
        key yet (often the work for this process is to find the key in some lookup
        table based on some other key we don't control).
        
        We have come up with a few solutions but everything seems to add complexity
        and backs our designs into a corner.
        
        What is frustrating is that most of the data is not really that big but we
        have a handful of topics we expect to require a lot of throughput.
        
        Is this just unavoidable complexity asociated with scale or am I thinking
        about this in the wrong way. We're going all in on the "turning the
        database inside out" architecture but we end up spending more time thinking
        about how stuff gets broken up into tasks and distributed than we are about
        our business.
        
        Do these problems seem familiar to anyone else?  Did you find any patterns
        that helped keep the complexity down.
        
        Cheers
        
    
    
    



Re: Partitioning at the edges

Posted by David Garcia <da...@spiceworks.com>.
The “simplest” way to solve this is to “repartition” your data (i.e. the streams you wish to join) with the partition key you wish to join on.  This obviously introduces redundancy, but it will solve your problem.  For example.. suppose you want to join topic T1 and topic T2…but they aren’t partitioned on the key you need.  You could write two “simple” repartition jobs for each topic (you can actually do this with one job):

T1 -> Job_T1 -> T1’
T2 -> Job_T2 -> T2’

T1’ and T2’ would be partitioned on your join key and would have the same number of partitions so that you have the guarantees you need to do the join.  (i.e. join T1’ and T2’).

-David


On 9/2/16, 8:43 PM, "Andy Chambers" <ac...@gmail.com> wrote:

    Hey Folks,
    
    We are having quite a bit trouble modelling the flow of data through a very
    kafka centric system
    
    As I understand it, every stream you might want to join with another must
    be partitioned the same way. But often streams at the edges of a system
    *cannot* be partitioned the same way because they don't have the partition
    key yet (often the work for this process is to find the key in some lookup
    table based on some other key we don't control).
    
    We have come up with a few solutions but everything seems to add complexity
    and backs our designs into a corner.
    
    What is frustrating is that most of the data is not really that big but we
    have a handful of topics we expect to require a lot of throughput.
    
    Is this just unavoidable complexity asociated with scale or am I thinking
    about this in the wrong way. We're going all in on the "turning the
    database inside out" architecture but we end up spending more time thinking
    about how stuff gets broken up into tasks and distributed than we are about
    our business.
    
    Do these problems seem familiar to anyone else?  Did you find any patterns
    that helped keep the complexity down.
    
    Cheers
    



Re: Partitioning at the edges

Posted by Eno Thereska <en...@gmail.com>.
Hi Andy,

One option would be to take the original transaction data and enrich it
with the customer ID and then send the result to a new topic, partitioned
appropriately. That would then allow you to do a join between that topic
and the ledger data (otherwise I don't see how you can join given that the
transactions do not have the customer ID). It's worth mentioning that in
Kafka trunk the repartitioning happens automatically (while in 0.10.0.0 the
user needs to manually repartition topics).

Eno

Begin forwarded message:

*From: *Andy Chambers <ac...@gmail.com>
*Subject: **Re: Partitioning at the edges*
*Date: *3 September 2016 at 17:57:53 BST
*To: *users@kafka.apache.org
*Reply-To: *users@kafka.apache.org

Hi Eno,

I'll try. We have a feed of transaction data from the bank. Each of which
we must try to associate with a customer in our system. Unfortunately the
transaction data doesn't include the customer-id itself but rather a
variety of other identifiers that we can use to lookup the customer-id in a
mapping table (this is also in kafka)

Once we have the customer-id, we produce some records to a "ledger-request"
topic (which is partitioned by customer-id). The exact sequence of records
produced depends on the type of transaction, and whether we were able to
find the customer-id but if all goes well, the ledger will produce events
on a "ledger-result" topic for each request record.

This is where we have a problem. The ledger is the bit that has to be super
performant so these topics must be highly partitioned on customer-id. But
then we'd like to join the results with the original bank-transaction
(which remember doesn't even have a customer-id) so we can mark it as being
handled or not when the result comes back.

Current plan is to just use a single partition for most topics but then
"wrap" the ledger system with a process that re-partitions it's input as
necessary for scale.

Cheers,
Andy

On Sat, Sep 3, 2016 at 6:13 AM, Eno Thereska <en...@gmail.com> wrote:

Hi Andy,

Could you share a bit more info or pseudocode so that we can understand
the scenario a bit better? Especially around the streams at the edges. How
are they created and what is the join meant to do?

Thanks
Eno

On 3 Sep 2016, at 02:43, Andy Chambers <ac...@gmail.com> wrote:

Hey Folks,

We are having quite a bit trouble modelling the flow of data through a

very

kafka centric system

As I understand it, every stream you might want to join with another must
be partitioned the same way. But often streams at the edges of a system
*cannot* be partitioned the same way because they don't have the

partition

key yet (often the work for this process is to find the key in some

lookup

table based on some other key we don't control).

We have come up with a few solutions but everything seems to add

complexity

and backs our designs into a corner.

What is frustrating is that most of the data is not really that big but

we

have a handful of topics we expect to require a lot of throughput.

Is this just unavoidable complexity asociated with scale or am I thinking
about this in the wrong way. We're going all in on the "turning the
database inside out" architecture but we end up spending more time

thinking

about how stuff gets broken up into tasks and distributed than we are

about

our business.

Do these problems seem familiar to anyone else?  Did you find any

patterns

that helped keep the complexity down.

Cheers

Re: Partitioning at the edges

Posted by Andy Chambers <ac...@gmail.com>.
Hi Eno,

I'll try. We have a feed of transaction data from the bank. Each of which
we must try to associate with a customer in our system. Unfortunately the
transaction data doesn't include the customer-id itself but rather a
variety of other identifiers that we can use to lookup the customer-id in a
mapping table (this is also in kafka)

Once we have the customer-id, we produce some records to a "ledger-request"
topic (which is partitioned by customer-id). The exact sequence of records
produced depends on the type of transaction, and whether we were able to
find the customer-id but if all goes well, the ledger will produce events
on a "ledger-result" topic for each request record.

This is where we have a problem. The ledger is the bit that has to be super
performant so these topics must be highly partitioned on customer-id. But
then we'd like to join the results with the original bank-transaction
(which remember doesn't even have a customer-id) so we can mark it as being
handled or not when the result comes back.

Current plan is to just use a single partition for most topics but then
"wrap" the ledger system with a process that re-partitions it's input as
necessary for scale.

Cheers,
Andy

On Sat, Sep 3, 2016 at 6:13 AM, Eno Thereska <en...@gmail.com> wrote:

> Hi Andy,
>
> Could you share a bit more info or pseudocode so that we can understand
> the scenario a bit better? Especially around the streams at the edges. How
> are they created and what is the join meant to do?
>
> Thanks
> Eno
>
> > On 3 Sep 2016, at 02:43, Andy Chambers <ac...@gmail.com> wrote:
> >
> > Hey Folks,
> >
> > We are having quite a bit trouble modelling the flow of data through a
> very
> > kafka centric system
> >
> > As I understand it, every stream you might want to join with another must
> > be partitioned the same way. But often streams at the edges of a system
> > *cannot* be partitioned the same way because they don't have the
> partition
> > key yet (often the work for this process is to find the key in some
> lookup
> > table based on some other key we don't control).
> >
> > We have come up with a few solutions but everything seems to add
> complexity
> > and backs our designs into a corner.
> >
> > What is frustrating is that most of the data is not really that big but
> we
> > have a handful of topics we expect to require a lot of throughput.
> >
> > Is this just unavoidable complexity asociated with scale or am I thinking
> > about this in the wrong way. We're going all in on the "turning the
> > database inside out" architecture but we end up spending more time
> thinking
> > about how stuff gets broken up into tasks and distributed than we are
> about
> > our business.
> >
> > Do these problems seem familiar to anyone else?  Did you find any
> patterns
> > that helped keep the complexity down.
> >
> > Cheers
>
>

Re: Partitioning at the edges

Posted by Eno Thereska <en...@gmail.com>.
Hi Andy,

Could you share a bit more info or pseudocode so that we can understand the scenario a bit better? Especially around the streams at the edges. How are they created and what is the join meant to do?

Thanks
Eno

> On 3 Sep 2016, at 02:43, Andy Chambers <ac...@gmail.com> wrote:
> 
> Hey Folks,
> 
> We are having quite a bit trouble modelling the flow of data through a very
> kafka centric system
> 
> As I understand it, every stream you might want to join with another must
> be partitioned the same way. But often streams at the edges of a system
> *cannot* be partitioned the same way because they don't have the partition
> key yet (often the work for this process is to find the key in some lookup
> table based on some other key we don't control).
> 
> We have come up with a few solutions but everything seems to add complexity
> and backs our designs into a corner.
> 
> What is frustrating is that most of the data is not really that big but we
> have a handful of topics we expect to require a lot of throughput.
> 
> Is this just unavoidable complexity asociated with scale or am I thinking
> about this in the wrong way. We're going all in on the "turning the
> database inside out" architecture but we end up spending more time thinking
> about how stuff gets broken up into tasks and distributed than we are about
> our business.
> 
> Do these problems seem familiar to anyone else?  Did you find any patterns
> that helped keep the complexity down.
> 
> Cheers