You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@samza.apache.org by 李斯宁 <li...@gmail.com> on 2016/07/01 00:35:51 UTC

The best way to import data into kv store?

hi guys,
I am trying use samza for realtime process.  I need to join stream with a
userid-db.  How can I import initial data from other place into kv store?

From the document, I can imagine how to build the userid-db from empty by
consuming log stream.  But in my case, I have historical userid-db data,
and I don't want to process long history log to build the userid-db from
empty. So I need to import userid-db from my old batch processing system.

any reply is appreciated, thanks in advance.

-- 
李斯宁

Re: The best way to import data into kv store?

Posted by Yi Pan <ni...@gmail.com>.

Hi, Sining,

Yes! What you did is exactly what I meant by "batch-to-stream job"! Enjoy
Samza!

-Yi

On Mon, Jul 11, 2016 at 8:50 AM, 李斯宁 <li...@gmail.com> wrote:

> hi, Yi
> Thanks to your respones.
> My old userid-db is stored in a hdfs folder, and I have found a way to
> import my userid data.
>
> 1) Create a MapReduce job to completely write userid data to a kafka topic,
> let's call it "import_uid"
> 2) In samza JoinTask's configuration, set "import_uid" as a bootstrap input
> stream.
> 3) In the task, when processing "import_uid" topic, write to the kv store.
> 4) In the task, when processing my realtime stream, read from kv store and
> do the join.
>
> The "key point" is import data with bootstrap stream.
> Is this your "batch-to-stream" approach mean?
>
>
> On Thu, Jul 7, 2016 at 1:05 AM, Yi Pan <ni...@gmail.com> wrote:
>
> > Hi, Sining,
> >
> > There are a few questions to be asked s.t. we know your application use
> > case better.
> >
> > 1) In what format is your old userid-db data?
> > 2) Is the old userid-db data partitioned using the same key and the same
> > number of partitions as you expect to consume in your Samza job?
> >
> > Generally speaking, we would have to employ a batch-to-stream push job
> > because
> > 1) Your old userid-db may not already be a RocksDB database file.
> > 2) Your old userid-db may not be partitioned the same way as you expect
> to
> > consume in your Samza job.
> > 3) The location of a specific partition of your userid-db in a Samza job
> is
> > dynamically allocated as YARN schedules the containers in the cluster.
> > Hence, where to copy the offline data over is not known apriori.
> >
> > -Yi
> >
> >
> > On Thu, Jun 30, 2016 at 5:35 PM, 李斯宁 <li...@gmail.com> wrote:
> >
> > > hi guys,
> > > I am trying use samza for realtime process.  I need to join stream
> with a
> > > userid-db.  How can I import initial data from other place into kv
> store?
> > >
> > > From the document, I can imagine how to build the userid-db from empty
> by
> > > consuming log stream.  But in my case, I have historical userid-db
> data,
> > > and I don't want to process long history log to build the userid-db
> from
> > > empty. So I need to import userid-db from my old batch processing
> system.
> > >
> > > any reply is appreciated, thanks in advance.
> > >
> > > --
> > > 李斯宁
> > >
> >
>
>
>
> --
> 李斯宁
>

Re: The best way to import data into kv store?

Posted by 李斯宁 <li...@gmail.com>.

hi, Yi
Thanks to your respones.
My old userid-db is stored in a hdfs folder, and I have found a way to
import my userid data.

1) Create a MapReduce job to completely write userid data to a kafka topic,
let's call it "import_uid"
2) In samza JoinTask's configuration, set "import_uid" as a bootstrap input
stream.
3) In the task, when processing "import_uid" topic, write to the kv store.
4) In the task, when processing my realtime stream, read from kv store and
do the join.

The "key point" is import data with bootstrap stream.
Is this your "batch-to-stream" approach mean?

On Thu, Jul 7, 2016 at 1:05 AM, Yi Pan <ni...@gmail.com> wrote:

> Hi, Sining,
>
> There are a few questions to be asked s.t. we know your application use
> case better.
>
> 1) In what format is your old userid-db data?
> 2) Is the old userid-db data partitioned using the same key and the same
> number of partitions as you expect to consume in your Samza job?
>
> Generally speaking, we would have to employ a batch-to-stream push job
> because
> 1) Your old userid-db may not already be a RocksDB database file.
> 2) Your old userid-db may not be partitioned the same way as you expect to
> consume in your Samza job.
> 3) The location of a specific partition of your userid-db in a Samza job is
> dynamically allocated as YARN schedules the containers in the cluster.
> Hence, where to copy the offline data over is not known apriori.
>
> -Yi
>
>
> On Thu, Jun 30, 2016 at 5:35 PM, 李斯宁 <li...@gmail.com> wrote:
>
> > hi guys,
> > I am trying use samza for realtime process.  I need to join stream with a
> > userid-db.  How can I import initial data from other place into kv store?
> >
> > From the document, I can imagine how to build the userid-db from empty by
> > consuming log stream.  But in my case, I have historical userid-db data,
> > and I don't want to process long history log to build the userid-db from
> > empty. So I need to import userid-db from my old batch processing system.
> >
> > any reply is appreciated, thanks in advance.
> >
> > --
> > 李斯宁
> >
>

-- 
李斯宁

Re: The best way to import data into kv store?

Posted by Yi Pan <ni...@gmail.com>.

Hi, Sining,

There are a few questions to be asked s.t. we know your application use
case better.

1) In what format is your old userid-db data?
2) Is the old userid-db data partitioned using the same key and the same
number of partitions as you expect to consume in your Samza job?

Generally speaking, we would have to employ a batch-to-stream push job
because
1) Your old userid-db may not already be a RocksDB database file.
2) Your old userid-db may not be partitioned the same way as you expect to
consume in your Samza job.
3) The location of a specific partition of your userid-db in a Samza job is
dynamically allocated as YARN schedules the containers in the cluster.
Hence, where to copy the offline data over is not known apriori.

-Yi

On Thu, Jun 30, 2016 at 5:35 PM, 李斯宁 <li...@gmail.com> wrote:

> hi guys,
> I am trying use samza for realtime process.  I need to join stream with a
> userid-db.  How can I import initial data from other place into kv store?
>
> From the document, I can imagine how to build the userid-db from empty by
> consuming log stream.  But in my case, I have historical userid-db data,
> and I don't want to process long history log to build the userid-db from
> empty. So I need to import userid-db from my old batch processing system.
>
> any reply is appreciated, thanks in advance.
>
> --
> 李斯宁
>