You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Andreas Simanowski <ae...@gmail.com> on 2015/05/05 20:47:07 UTC

Local state in Samza - sharing data between tasks

Hello Samza community:

I am very new to Samza and currently looking at how to use Samza and its
key-value store. I have run into the following and was hoping someone could
point me in the right direction.

Say we have an input stream being consumed by more than one task (one task
per partition). Each task has a local key-value store which it will
reference when processing the messages. Because the messages coming into
the input stream are random (i.e. can hit any partition and therefore any
task), each task will need its own copy of the data (i.e. the data needs to
be duplicated across each task). From time-to-time this local data would
also need to be updated with changes. What approaches are there to share
data between the tasks to keep them up to date?

Thanks for the help!

-Andreas

Re: Local state in Samza - sharing data between tasks

Posted by Andreas Simanowski <ae...@gmail.com>.
Yi thanks for the input. I've been out sick, so please excuse the delayed
response. I am still working out the use case with my team and will report
back next week.

Thanks!

On Tue, May 5, 2015 at 4:01 PM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Andreas,
>
> Are you describing a use case where the *same* copy of data is shared among
> all tasks? That will depend on a lot factors:
> 1. is your data size huge?
> 2. Can your data be partitioned to work with a single partition of input
> stream?
> 3. Do you have a means to bootstrap the data from a stream? And how often
> do you need to do the bootstrap?
>
> The first question to be answered actually seems to be the question 3. If
> you have a way to bootstrap the data from a stream, Samza can always
> bootstrap the local stores from that ingestion stream. Then, based on your
> total data size, how often you need to bootstrap, and whether your "shared"
> data can be partitioned to work with a single partition of the input
> stream, we can find out different solutions to suite your use case the
> best.
>
> On Tue, May 5, 2015 at 3:38 PM, Andreas Simanowski <ae...@gmail.com>
> wrote:
>
> > Hi Yan, thanks for the reply.
> >
> > So yes, you are correct it would not be random which partition a message
> > hits. We would use a partition key (sorry I missed that).
> >
> > The "data" I was referring to is the local KV-store data for each task.
> Is
> > there a way to synchronize or replicate the data from the KV-store across
> > the tasks, so that each tasks contains the same information in their
> > respective KV-store? Specifically I'm referring to the page on State
> > Management:
> >
> > "If you have some data that you want to share between tasks (across
> > partition boundaries), you need to go to some additional effort to
> > repartition and distribute the data. Each task will need its own copy of
> > the data, so this may use more space overall."
> >
> > Is there a simple means to ensure each task gets a copy of the data?
> >
> > Thanks!
> >
> > On Tue, May 5, 2015 at 2:44 PM, Yan Fang <ya...@gmail.com> wrote:
> >
> > > Hi Andreas,
> > >
> > > Not quite understand this part
> > >
> > > "Because the messages coming into the input stream are random (i.e. can
> > hit
> > > any partition and therefore any task), each task will need its own copy
> > of
> > > the data (i.e. the data needs to be duplicated across each task)."
> > >
> > > Messages come into the input stream based on the partition key (not
> > totally
> > > random). Why does each task need its own copy of the data? Do you mean
> > the
> > > copy of the data in other partitions?
> > >
> > > Cheers,
> > >
> > > Fang, Yan
> > > yanfang724@gmail.com
> > >
> > > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <
> aesimano@gmail.com>
> > > wrote:
> > >
> > > > Hello Samza community:
> > > >
> > > > I am very new to Samza and currently looking at how to use Samza and
> > its
> > > > key-value store. I have run into the following and was hoping someone
> > > could
> > > > point me in the right direction.
> > > >
> > > > Say we have an input stream being consumed by more than one task (one
> > > task
> > > > per partition). Each task has a local key-value store which it will
> > > > reference when processing the messages. Because the messages coming
> > into
> > > > the input stream are random (i.e. can hit any partition and therefore
> > any
> > > > task), each task will need its own copy of the data (i.e. the data
> > needs
> > > to
> > > > be duplicated across each task). From time-to-time this local data
> > would
> > > > also need to be updated with changes. What approaches are there to
> > share
> > > > data between the tasks to keep them up to date?
> > > >
> > > > Thanks for the help!
> > > >
> > > > -Andreas
> > > >
> > >
> >
>

Re: Local state in Samza - sharing data between tasks

Posted by Yi Pan <ni...@gmail.com>.
Hi, Andreas,

Are you describing a use case where the *same* copy of data is shared among
all tasks? That will depend on a lot factors:
1. is your data size huge?
2. Can your data be partitioned to work with a single partition of input
stream?
3. Do you have a means to bootstrap the data from a stream? And how often
do you need to do the bootstrap?

The first question to be answered actually seems to be the question 3. If
you have a way to bootstrap the data from a stream, Samza can always
bootstrap the local stores from that ingestion stream. Then, based on your
total data size, how often you need to bootstrap, and whether your "shared"
data can be partitioned to work with a single partition of the input
stream, we can find out different solutions to suite your use case the best.

On Tue, May 5, 2015 at 3:38 PM, Andreas Simanowski <ae...@gmail.com>
wrote:

> Hi Yan, thanks for the reply.
>
> So yes, you are correct it would not be random which partition a message
> hits. We would use a partition key (sorry I missed that).
>
> The "data" I was referring to is the local KV-store data for each task. Is
> there a way to synchronize or replicate the data from the KV-store across
> the tasks, so that each tasks contains the same information in their
> respective KV-store? Specifically I'm referring to the page on State
> Management:
>
> "If you have some data that you want to share between tasks (across
> partition boundaries), you need to go to some additional effort to
> repartition and distribute the data. Each task will need its own copy of
> the data, so this may use more space overall."
>
> Is there a simple means to ensure each task gets a copy of the data?
>
> Thanks!
>
> On Tue, May 5, 2015 at 2:44 PM, Yan Fang <ya...@gmail.com> wrote:
>
> > Hi Andreas,
> >
> > Not quite understand this part
> >
> > "Because the messages coming into the input stream are random (i.e. can
> hit
> > any partition and therefore any task), each task will need its own copy
> of
> > the data (i.e. the data needs to be duplicated across each task)."
> >
> > Messages come into the input stream based on the partition key (not
> totally
> > random). Why does each task need its own copy of the data? Do you mean
> the
> > copy of the data in other partitions?
> >
> > Cheers,
> >
> > Fang, Yan
> > yanfang724@gmail.com
> >
> > On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <ae...@gmail.com>
> > wrote:
> >
> > > Hello Samza community:
> > >
> > > I am very new to Samza and currently looking at how to use Samza and
> its
> > > key-value store. I have run into the following and was hoping someone
> > could
> > > point me in the right direction.
> > >
> > > Say we have an input stream being consumed by more than one task (one
> > task
> > > per partition). Each task has a local key-value store which it will
> > > reference when processing the messages. Because the messages coming
> into
> > > the input stream are random (i.e. can hit any partition and therefore
> any
> > > task), each task will need its own copy of the data (i.e. the data
> needs
> > to
> > > be duplicated across each task). From time-to-time this local data
> would
> > > also need to be updated with changes. What approaches are there to
> share
> > > data between the tasks to keep them up to date?
> > >
> > > Thanks for the help!
> > >
> > > -Andreas
> > >
> >
>

Re: Local state in Samza - sharing data between tasks

Posted by Andreas Simanowski <ae...@gmail.com>.
Hi Yan, thanks for the reply.

So yes, you are correct it would not be random which partition a message
hits. We would use a partition key (sorry I missed that).

The "data" I was referring to is the local KV-store data for each task. Is
there a way to synchronize or replicate the data from the KV-store across
the tasks, so that each tasks contains the same information in their
respective KV-store? Specifically I'm referring to the page on State
Management:

"If you have some data that you want to share between tasks (across
partition boundaries), you need to go to some additional effort to
repartition and distribute the data. Each task will need its own copy of
the data, so this may use more space overall."

Is there a simple means to ensure each task gets a copy of the data?

Thanks!

On Tue, May 5, 2015 at 2:44 PM, Yan Fang <ya...@gmail.com> wrote:

> Hi Andreas,
>
> Not quite understand this part
>
> "Because the messages coming into the input stream are random (i.e. can hit
> any partition and therefore any task), each task will need its own copy of
> the data (i.e. the data needs to be duplicated across each task)."
>
> Messages come into the input stream based on the partition key (not totally
> random). Why does each task need its own copy of the data? Do you mean the
> copy of the data in other partitions?
>
> Cheers,
>
> Fang, Yan
> yanfang724@gmail.com
>
> On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <ae...@gmail.com>
> wrote:
>
> > Hello Samza community:
> >
> > I am very new to Samza and currently looking at how to use Samza and its
> > key-value store. I have run into the following and was hoping someone
> could
> > point me in the right direction.
> >
> > Say we have an input stream being consumed by more than one task (one
> task
> > per partition). Each task has a local key-value store which it will
> > reference when processing the messages. Because the messages coming into
> > the input stream are random (i.e. can hit any partition and therefore any
> > task), each task will need its own copy of the data (i.e. the data needs
> to
> > be duplicated across each task). From time-to-time this local data would
> > also need to be updated with changes. What approaches are there to share
> > data between the tasks to keep them up to date?
> >
> > Thanks for the help!
> >
> > -Andreas
> >
>

Re: Local state in Samza - sharing data between tasks

Posted by Yan Fang <ya...@gmail.com>.
Hi Andreas,

Not quite understand this part

"Because the messages coming into the input stream are random (i.e. can hit
any partition and therefore any task), each task will need its own copy of
the data (i.e. the data needs to be duplicated across each task)."

Messages come into the input stream based on the partition key (not totally
random). Why does each task need its own copy of the data? Do you mean the
copy of the data in other partitions?

Cheers,

Fang, Yan
yanfang724@gmail.com

On Tue, May 5, 2015 at 11:47 AM, Andreas Simanowski <ae...@gmail.com>
wrote:

> Hello Samza community:
>
> I am very new to Samza and currently looking at how to use Samza and its
> key-value store. I have run into the following and was hoping someone could
> point me in the right direction.
>
> Say we have an input stream being consumed by more than one task (one task
> per partition). Each task has a local key-value store which it will
> reference when processing the messages. Because the messages coming into
> the input stream are random (i.e. can hit any partition and therefore any
> task), each task will need its own copy of the data (i.e. the data needs to
> be duplicated across each task). From time-to-time this local data would
> also need to be updated with changes. What approaches are there to share
> data between the tasks to keep them up to date?
>
> Thanks for the help!
>
> -Andreas
>