You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by Shouichi Kamiya <sh...@gmail.com> on 2015/09/29 19:18:07 UTC

Where to store joined results in table-table join example

Hello everyone,

I am new to stream processing and need a clarification on the
table-table join example in the state management document.

http://samza.apache.org/learn/documentation/0.9/container/state-management.html

> Implementation: The job subscribes to the change streams for the user profiles database and the user settings database, both partitioned by user_id. The job keeps a key-value store keyed by user_id, which contains the latest profile record and the latest settings record for each user_id. When a new event comes in from either stream, the job looks up the current value in its store, updates the appropriate fields (depending on whether it was a profile update or a settings update), and writes back the new joined record to the store. The changelog of the store doubles as the output stream of the task.

I understand that the job stores the latest profile and settings
records in the local key-value store (for performance). I don't
understand where to store joined results. Should I store them in the
local kv store or external database? How can other tasks or services
fetch the joined results if they are stored in the local kv store?

Sincerely,
Shouichi

-- 
Shouichi Kamiya

Re: Where to store joined results in table-table join example

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.

Hi Shouichi,

We don't allows external services to access the local KV store as it is
meant for local computation only. Hence, if you store your join results in
the local KV store, other services cannot access the join results.

If you want external services to access your join results, you should make
it available outside of your Samza job. You can write to an external DB
from the Samza job itself, although a remote write call to a DB might
affect the throughput of the job itself. An optimization can be to batch
the join results and update the external DB.

If the external results store does not allow asynchronous writes or batched
writes, I suggest writing the join results to an output stream and write
another Samza job to write to the external DB.

Cheers!
Navina

On Tue, Sep 29, 2015 at 10:48 PM, Shouichi Kamiya <shouichi.kamiya@gmail.com
> wrote:

> Hello everyone,
>
> I am new to stream processing and need a clarification on the
> table-table join example in the state management document.
>
>
> http://samza.apache.org/learn/documentation/0.9/container/state-management.html
>
> > Implementation: The job subscribes to the change streams for the user
> profiles database and the user settings database, both partitioned by
> user_id. The job keeps a key-value store keyed by user_id, which contains
> the latest profile record and the latest settings record for each user_id.
> When a new event comes in from either stream, the job looks up the current
> value in its store, updates the appropriate fields (depending on whether it
> was a profile update or a settings update), and writes back the new joined
> record to the store. The changelog of the store doubles as the output
> stream of the task.
>
> I understand that the job stores the latest profile and settings
> records in the local key-value store (for performance). I don't
> understand where to store joined results. Should I store them in the
> local kv store or external database? How can other tasks or services
> fetch the joined results if they are stored in the local kv store?
>
> Sincerely,
> Shouichi
>
> --
> Shouichi Kamiya
>

-- 
Navina R.