You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@samza.apache.org by Michael Sklyar <mi...@gmail.com> on 2015/09/20 10:28:16 UTC

Asynchronous approach and samza

Hi,

What would be the best approach for doing "blocking" operations in Samza?

For example, we have a kafka stream of urls for which we need to gather
external data via HTTP (such as alexa rank, get the page title and
headers..). Other scenarios include database access and decision making via
a rule engine.

Samza processes messages in a singe thread, HTTP requests might take
hundreds of miliseconds. With the single threaded design the throughput
would be very limited, which can be solved with an asynchronous approach.
However Samza documentation explicitely states
"*You are strongly discouraged from using threads in your job’s code*".

It seems that Samza design suits very well "data transformation" scenarios,
what is not clear is how well can it support external services?

Thanks,
Michael Sklyar

Re: Asynchronous approach and samza

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
@Ken: I was going to suggest batch processing with Samza, which is pretty
much what you just said. Thanks for your valuable input. :)

@Michael: I think the pattern I suggested will not work out for your data
scale. Following a batch processing model with Samza can fulfill the
requirements of your use-case.

Cheers!
Navina

On Mon, Sep 21, 2015 at 10:28 PM, Jordan Shaw <jo...@pubnub.com> wrote:

> Michael,
> Why not just have a pool of workers outside of Samza that are pushing the
> raw, or subset of the raw crawler input into a Kafka topic then have the
> Samza do the compute/stream work? Basically Samza is not the right tool for
> what your suggesting but could be used for downstream work, in my opinion.
> -Jordan
>
> On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <kk...@transpac.com>
> wrote:
>
> > Hi Michael (& Navina),
> >
> > I don't think you need to create a separate background process, at least
> > for the case of web crawling.
> >
> > The challenge is to efficiently use one Samza process to simultaneously
> > fetch many URLs.
> >
> > Which does increase the complexity of that process's code, as you wind up
> > having to manage either a multi-threaded or async fetch state.
> >
> > But that's the same as for Hadoop-based crawlers, where you have a
> limited
> > number of parallel reduce tasks that are doing the fetching - see Nutch
> and
> > Bixo for examples, e.g. FetchBuffer.
> >
> > And it's the same for storm-crawler, another project I've been involved
> > with in the past.
> >
> > -- Ken
> >
> > > From: Michael Sklyar
> > > Sent: September 21, 2015 5:19:52am PDT
> > > To: dev@samza.apache.org
> > > Subject: Re: Asynchronous approach and samza
> > >
> > > Thanks Navina,
> > > it is much more clear now.
> > >
> > > Unfortunately, in our case, we can not bootstrap the data in advance(we
> > > can't pre-fetch all existing URL's titles and headers in advance).
> > > Sounds to me that, if we want to use Samza, we will need a background
> > > process that will be synchronized with the main event loop of the task
> > > (+hande back-pressure so not more than X requests can be made
> > > simultaneously).
> > >
> > >
> > > Regards,
> > > Michael
> > >
> > > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> > > nramesh@linkedin.com.invalid> wrote:
> > >
> > >> Hi Michael,
> > >> {quote}
> > >> Do you mean that in such a case Samza should be combined with another
> > >> Stream processing framework (such as Storm)?
> > >> {quote}
> > >> No. I didn't mean combining it with any other framework.
> > >>
> > >> {quote}
> > >> "the job bootstraps the data from the source" - do you mean that
> > >> you have a background process for this purpose or just listen to an
> > >> additional stream of change log from some other framework?
> > >> {quote}
> > >> I didn't mean a background process. I meant just listening from a
> > stream of
> > >> change log from a data source.
> > >>
> > >> At LinkedIn, we use databus. The jobs will configure databus (for a
> give
> > >> data source) as one of the input streams for the job. Databus is a
> > source
> > >> agnostic distributed change data capture system. You can find more
> > >> information here <https://github.com/linkedin/databus>. The advantage
> > is
> > >> that the databus client is capable of "bootstrapping" from the source
> > >> automatically and then, switching to simply capture changes from the
> > data
> > >> source. In this scenario, Samza doesn't do anything special, except
> > that it
> > >> will continue consuming from databus stream when bootstrapping. Once
> > >> bootstrap is complete, the job can start processing events from other
> > input
> > >> streams as well.
> > >>
> > >> I hope my explanation clarifies your question. :)
> > >>
> > >> Thanks!
> > >> Navina
> > >>
> > >>
> > >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mi...@gmail.com>
> > >> wrote:
> > >>
> > >>> Thank you for your replies,
> > >>>
> > >>> I understand that making an external blocking request in a single
> event
> > >>> thread will result in extremely low throughput. However this can be
> > >> solved
> > >>> by multi threading and/or asynchronous approach. It is clear that in
> > any
> > >>> case using external services can never achieve the throughput of
> simple
> > >>> transformations. However most stream processing need, from time to
> > time,
> > >> to
> > >>> query some external storage, web service etc...
> > >>>
> > >>> Do you mean that in such a case Samza should be combined with another
> > >>> Stream processing framework (such as Storm)?
> > >>>
> > >>> Navina, "the job bootstraps the data from the source" - do you mean
> > that
> > >>> you have a background process for this purpose or just listen to an
> > >>> additional stream of change log from some other framework?
> > >>>
> > >>> Thanks,
> > >>> Michael
> > >>>
> > >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> > >>> <nramesh@linkedin.com.invalid
> > >>>> wrote:
> > >>>
> > >>>> Hi Michael,
> > >>>> I agree with what Yan said. While nothing stops you from doing it,
> it
> > >> is
> > >>>> not encouraged as it affect throughput and realtime processing.
> > >>>>
> > >>>> {quote}
> > >>>> It seems that Samza design suits very well "data transformation"
> > >>> scenarios,
> > >>>> what is not clear is how well can it support external services?
> > >>>> {quote}
> > >>>> We have some similar use-cases at LinkedIn where the Samza jobs need
> > to
> > >>>> query to external data sources. We do use a pattern where the job
> > >>>> bootstraps the data from the source using a change-capture system
> like
> > >>>> databus and buffer it locally, before processing from input streams.
> > >>>> Depending on the scale of your data, this model may or may not work
> > for
> > >>>> you. However, there is no in-built support for this in Samza.
> > >>>>
> > >>>> Thanks!
> > >>>> Navina
> > >>>>
> > >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com>
> > >> wrote:
> > >>>>
> > >>>>> Hi Michael,
> > >>>>>
> > >>>>> Samza is designed for high-throughput and realtime processing. If
> you
> > >>> are
> > >>>>> using HTTP request/external service, you may not retrieve the same
> > >>>>> performance as not using it. However, technically speaking, there
> is
> > >>>>> nothing blocking you to do this, (well, discouraged anyway :).
> Samza
> > >> by
> > >>>>> default does not provide this feature. So you maybe a little
> cautious
> > >>>> when
> > >>>>> implementing this.
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Fang, Yan
> > >>>>> yanfang724@gmail.com
> > >>>>>
> > >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <
> mikeskali@gmail.com
> > >>>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Hi,
> > >>>>>>
> > >>>>>> What would be the best approach for doing "blocking" operations in
> > >>>> Samza?
> > >>>>>>
> > >>>>>> For example, we have a kafka stream of urls for which we need to
> > >>> gather
> > >>>>>> external data via HTTP (such as alexa rank, get the page title and
> > >>>>>> headers..). Other scenarios include database access and decision
> > >>> making
> > >>>>> via
> > >>>>>> a rule engine.
> > >>>>>>
> > >>>>>> Samza processes messages in a singe thread, HTTP requests might
> > >> take
> > >>>>>> hundreds of miliseconds. With the single threaded design the
> > >>> throughput
> > >>>>>> would be very limited, which can be solved with an asynchronous
> > >>>> approach.
> > >>>>>> However Samza documentation explicitely states
> > >>>>>> "*You are strongly discouraged from using threads in your job’s
> > >>> code*".
> > >>>>>>
> > >>>>>> It seems that Samza design suits very well "data transformation"
> > >>>>> scenarios,
> > >>>>>> what is not clear is how well can it support external services?
> > >>>>>>
> > >>>>>> Thanks,
> > >>>>>> Michael Sklyar
> >
> >
> >
> >
> >
> > --------------------------
> > Ken Krugler
> > +1 530-210-6378
> > http://www.scaleunlimited.com
> > custom big data solutions & training
> > Hadoop, Cascading, Cassandra & Solr
> >
> >
> >
> >
> >
> >
>
>
> --
> Jordan Shaw
> Full Stack Software Engineer
> PubNub Inc
> 1045 17th St
> San Francisco, CA 94107
>



-- 
Navina R.

Re: Asynchronous approach and samza

Posted by Jordan Shaw <jo...@pubnub.com>.
Michael,
Why not just have a pool of workers outside of Samza that are pushing the
raw, or subset of the raw crawler input into a Kafka topic then have the
Samza do the compute/stream work? Basically Samza is not the right tool for
what your suggesting but could be used for downstream work, in my opinion.
-Jordan

On Mon, Sep 21, 2015 at 9:08 AM, Ken Krugler <kk...@transpac.com>
wrote:

> Hi Michael (& Navina),
>
> I don't think you need to create a separate background process, at least
> for the case of web crawling.
>
> The challenge is to efficiently use one Samza process to simultaneously
> fetch many URLs.
>
> Which does increase the complexity of that process's code, as you wind up
> having to manage either a multi-threaded or async fetch state.
>
> But that's the same as for Hadoop-based crawlers, where you have a limited
> number of parallel reduce tasks that are doing the fetching - see Nutch and
> Bixo for examples, e.g. FetchBuffer.
>
> And it's the same for storm-crawler, another project I've been involved
> with in the past.
>
> -- Ken
>
> > From: Michael Sklyar
> > Sent: September 21, 2015 5:19:52am PDT
> > To: dev@samza.apache.org
> > Subject: Re: Asynchronous approach and samza
> >
> > Thanks Navina,
> > it is much more clear now.
> >
> > Unfortunately, in our case, we can not bootstrap the data in advance(we
> > can't pre-fetch all existing URL's titles and headers in advance).
> > Sounds to me that, if we want to use Samza, we will need a background
> > process that will be synchronized with the main event loop of the task
> > (+hande back-pressure so not more than X requests can be made
> > simultaneously).
> >
> >
> > Regards,
> > Michael
> >
> > On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> > nramesh@linkedin.com.invalid> wrote:
> >
> >> Hi Michael,
> >> {quote}
> >> Do you mean that in such a case Samza should be combined with another
> >> Stream processing framework (such as Storm)?
> >> {quote}
> >> No. I didn't mean combining it with any other framework.
> >>
> >> {quote}
> >> "the job bootstraps the data from the source" - do you mean that
> >> you have a background process for this purpose or just listen to an
> >> additional stream of change log from some other framework?
> >> {quote}
> >> I didn't mean a background process. I meant just listening from a
> stream of
> >> change log from a data source.
> >>
> >> At LinkedIn, we use databus. The jobs will configure databus (for a give
> >> data source) as one of the input streams for the job. Databus is a
> source
> >> agnostic distributed change data capture system. You can find more
> >> information here <https://github.com/linkedin/databus>. The advantage
> is
> >> that the databus client is capable of "bootstrapping" from the source
> >> automatically and then, switching to simply capture changes from the
> data
> >> source. In this scenario, Samza doesn't do anything special, except
> that it
> >> will continue consuming from databus stream when bootstrapping. Once
> >> bootstrap is complete, the job can start processing events from other
> input
> >> streams as well.
> >>
> >> I hope my explanation clarifies your question. :)
> >>
> >> Thanks!
> >> Navina
> >>
> >>
> >> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mi...@gmail.com>
> >> wrote:
> >>
> >>> Thank you for your replies,
> >>>
> >>> I understand that making an external blocking request in a single event
> >>> thread will result in extremely low throughput. However this can be
> >> solved
> >>> by multi threading and/or asynchronous approach. It is clear that in
> any
> >>> case using external services can never achieve the throughput of simple
> >>> transformations. However most stream processing need, from time to
> time,
> >> to
> >>> query some external storage, web service etc...
> >>>
> >>> Do you mean that in such a case Samza should be combined with another
> >>> Stream processing framework (such as Storm)?
> >>>
> >>> Navina, "the job bootstraps the data from the source" - do you mean
> that
> >>> you have a background process for this purpose or just listen to an
> >>> additional stream of change log from some other framework?
> >>>
> >>> Thanks,
> >>> Michael
> >>>
> >>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> >>> <nramesh@linkedin.com.invalid
> >>>> wrote:
> >>>
> >>>> Hi Michael,
> >>>> I agree with what Yan said. While nothing stops you from doing it, it
> >> is
> >>>> not encouraged as it affect throughput and realtime processing.
> >>>>
> >>>> {quote}
> >>>> It seems that Samza design suits very well "data transformation"
> >>> scenarios,
> >>>> what is not clear is how well can it support external services?
> >>>> {quote}
> >>>> We have some similar use-cases at LinkedIn where the Samza jobs need
> to
> >>>> query to external data sources. We do use a pattern where the job
> >>>> bootstraps the data from the source using a change-capture system like
> >>>> databus and buffer it locally, before processing from input streams.
> >>>> Depending on the scale of your data, this model may or may not work
> for
> >>>> you. However, there is no in-built support for this in Samza.
> >>>>
> >>>> Thanks!
> >>>> Navina
> >>>>
> >>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com>
> >> wrote:
> >>>>
> >>>>> Hi Michael,
> >>>>>
> >>>>> Samza is designed for high-throughput and realtime processing. If you
> >>> are
> >>>>> using HTTP request/external service, you may not retrieve the same
> >>>>> performance as not using it. However, technically speaking, there is
> >>>>> nothing blocking you to do this, (well, discouraged anyway :). Samza
> >> by
> >>>>> default does not provide this feature. So you maybe a little cautious
> >>>> when
> >>>>> implementing this.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Fang, Yan
> >>>>> yanfang724@gmail.com
> >>>>>
> >>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikeskali@gmail.com
> >>>
> >>>>> wrote:
> >>>>>
> >>>>>> Hi,
> >>>>>>
> >>>>>> What would be the best approach for doing "blocking" operations in
> >>>> Samza?
> >>>>>>
> >>>>>> For example, we have a kafka stream of urls for which we need to
> >>> gather
> >>>>>> external data via HTTP (such as alexa rank, get the page title and
> >>>>>> headers..). Other scenarios include database access and decision
> >>> making
> >>>>> via
> >>>>>> a rule engine.
> >>>>>>
> >>>>>> Samza processes messages in a singe thread, HTTP requests might
> >> take
> >>>>>> hundreds of miliseconds. With the single threaded design the
> >>> throughput
> >>>>>> would be very limited, which can be solved with an asynchronous
> >>>> approach.
> >>>>>> However Samza documentation explicitely states
> >>>>>> "*You are strongly discouraged from using threads in your job’s
> >>> code*".
> >>>>>>
> >>>>>> It seems that Samza design suits very well "data transformation"
> >>>>> scenarios,
> >>>>>> what is not clear is how well can it support external services?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Michael Sklyar
>
>
>
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> custom big data solutions & training
> Hadoop, Cascading, Cassandra & Solr
>
>
>
>
>
>


-- 
Jordan Shaw
Full Stack Software Engineer
PubNub Inc
1045 17th St
San Francisco, CA 94107

RE: Asynchronous approach and samza

Posted by Ken Krugler <kk...@transpac.com>.
Hi Michael (& Navina),

I don't think you need to create a separate background process, at least for the case of web crawling.

The challenge is to efficiently use one Samza process to simultaneously fetch many URLs.

Which does increase the complexity of that process's code, as you wind up having to manage either a multi-threaded or async fetch state.

But that's the same as for Hadoop-based crawlers, where you have a limited number of parallel reduce tasks that are doing the fetching - see Nutch and Bixo for examples, e.g. FetchBuffer.

And it's the same for storm-crawler, another project I've been involved with in the past.

-- Ken

> From: Michael Sklyar
> Sent: September 21, 2015 5:19:52am PDT
> To: dev@samza.apache.org
> Subject: Re: Asynchronous approach and samza
> 
> Thanks Navina,
> it is much more clear now.
> 
> Unfortunately, in our case, we can not bootstrap the data in advance(we
> can't pre-fetch all existing URL's titles and headers in advance).
> Sounds to me that, if we want to use Samza, we will need a background
> process that will be synchronized with the main event loop of the task
> (+hande back-pressure so not more than X requests can be made
> simultaneously).
> 
> 
> Regards,
> Michael
> 
> On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
> nramesh@linkedin.com.invalid> wrote:
> 
>> Hi Michael,
>> {quote}
>> Do you mean that in such a case Samza should be combined with another
>> Stream processing framework (such as Storm)?
>> {quote}
>> No. I didn't mean combining it with any other framework.
>> 
>> {quote}
>> "the job bootstraps the data from the source" - do you mean that
>> you have a background process for this purpose or just listen to an
>> additional stream of change log from some other framework?
>> {quote}
>> I didn't mean a background process. I meant just listening from a stream of
>> change log from a data source.
>> 
>> At LinkedIn, we use databus. The jobs will configure databus (for a give
>> data source) as one of the input streams for the job. Databus is a source
>> agnostic distributed change data capture system. You can find more
>> information here <https://github.com/linkedin/databus>. The advantage is
>> that the databus client is capable of "bootstrapping" from the source
>> automatically and then, switching to simply capture changes from the data
>> source. In this scenario, Samza doesn't do anything special, except that it
>> will continue consuming from databus stream when bootstrapping. Once
>> bootstrap is complete, the job can start processing events from other input
>> streams as well.
>> 
>> I hope my explanation clarifies your question. :)
>> 
>> Thanks!
>> Navina
>> 
>> 
>> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mi...@gmail.com>
>> wrote:
>> 
>>> Thank you for your replies,
>>> 
>>> I understand that making an external blocking request in a single event
>>> thread will result in extremely low throughput. However this can be
>> solved
>>> by multi threading and/or asynchronous approach. It is clear that in any
>>> case using external services can never achieve the throughput of simple
>>> transformations. However most stream processing need, from time to time,
>> to
>>> query some external storage, web service etc...
>>> 
>>> Do you mean that in such a case Samza should be combined with another
>>> Stream processing framework (such as Storm)?
>>> 
>>> Navina, "the job bootstraps the data from the source" - do you mean that
>>> you have a background process for this purpose or just listen to an
>>> additional stream of change log from some other framework?
>>> 
>>> Thanks,
>>> Michael
>>> 
>>> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
>>> <nramesh@linkedin.com.invalid
>>>> wrote:
>>> 
>>>> Hi Michael,
>>>> I agree with what Yan said. While nothing stops you from doing it, it
>> is
>>>> not encouraged as it affect throughput and realtime processing.
>>>> 
>>>> {quote}
>>>> It seems that Samza design suits very well "data transformation"
>>> scenarios,
>>>> what is not clear is how well can it support external services?
>>>> {quote}
>>>> We have some similar use-cases at LinkedIn where the Samza jobs need to
>>>> query to external data sources. We do use a pattern where the job
>>>> bootstraps the data from the source using a change-capture system like
>>>> databus and buffer it locally, before processing from input streams.
>>>> Depending on the scale of your data, this model may or may not work for
>>>> you. However, there is no in-built support for this in Samza.
>>>> 
>>>> Thanks!
>>>> Navina
>>>> 
>>>> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com>
>> wrote:
>>>> 
>>>>> Hi Michael,
>>>>> 
>>>>> Samza is designed for high-throughput and realtime processing. If you
>>> are
>>>>> using HTTP request/external service, you may not retrieve the same
>>>>> performance as not using it. However, technically speaking, there is
>>>>> nothing blocking you to do this, (well, discouraged anyway :). Samza
>> by
>>>>> default does not provide this feature. So you maybe a little cautious
>>>> when
>>>>> implementing this.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Fang, Yan
>>>>> yanfang724@gmail.com
>>>>> 
>>>>> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikeskali@gmail.com
>>> 
>>>>> wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> What would be the best approach for doing "blocking" operations in
>>>> Samza?
>>>>>> 
>>>>>> For example, we have a kafka stream of urls for which we need to
>>> gather
>>>>>> external data via HTTP (such as alexa rank, get the page title and
>>>>>> headers..). Other scenarios include database access and decision
>>> making
>>>>> via
>>>>>> a rule engine.
>>>>>> 
>>>>>> Samza processes messages in a singe thread, HTTP requests might
>> take
>>>>>> hundreds of miliseconds. With the single threaded design the
>>> throughput
>>>>>> would be very limited, which can be solved with an asynchronous
>>>> approach.
>>>>>> However Samza documentation explicitely states
>>>>>> "*You are strongly discouraged from using threads in your job’s
>>> code*".
>>>>>> 
>>>>>> It seems that Samza design suits very well "data transformation"
>>>>> scenarios,
>>>>>> what is not clear is how well can it support external services?
>>>>>> 
>>>>>> Thanks,
>>>>>> Michael Sklyar





--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr






Re: Asynchronous approach and samza

Posted by Michael Sklyar <mi...@gmail.com>.
Thanks Navina,
it is much more clear now.

Unfortunately, in our case, we can not bootstrap the data in advance(we
can't pre-fetch all existing URL's titles and headers in advance).
Sounds to me that, if we want to use Samza, we will need a background
process that will be synchronized with the main event loop of the task
(+hande back-pressure so not more than X requests can be made
simultaneously).


Regards,
Michael

On Mon, Sep 21, 2015 at 12:24 PM, Navina Ramesh <
nramesh@linkedin.com.invalid> wrote:

> Hi Michael,
> {quote}
> Do you mean that in such a case Samza should be combined with another
> Stream processing framework (such as Storm)?
> {quote}
> No. I didn't mean combining it with any other framework.
>
> {quote}
> "the job bootstraps the data from the source" - do you mean that
> you have a background process for this purpose or just listen to an
> additional stream of change log from some other framework?
> {quote}
> I didn't mean a background process. I meant just listening from a stream of
> change log from a data source.
>
> At LinkedIn, we use databus. The jobs will configure databus (for a give
> data source) as one of the input streams for the job. Databus is a source
> agnostic distributed change data capture system. You can find more
> information here <https://github.com/linkedin/databus>. The advantage is
> that the databus client is capable of "bootstrapping" from the source
> automatically and then, switching to simply capture changes from the data
> source. In this scenario, Samza doesn't do anything special, except that it
> will continue consuming from databus stream when bootstrapping. Once
> bootstrap is complete, the job can start processing events from other input
> streams as well.
>
> I hope my explanation clarifies your question. :)
>
> Thanks!
> Navina
>
>
> On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mi...@gmail.com>
> wrote:
>
> > Thank you for your replies,
> >
> > I understand that making an external blocking request in a single event
> > thread will result in extremely low throughput. However this can be
> solved
> > by multi threading and/or asynchronous approach. It is clear that in any
> > case using external services can never achieve the throughput of simple
> > transformations. However most stream processing need, from time to time,
> to
> > query some external storage, web service etc...
> >
> > Do you mean that in such a case Samza should be combined with another
> > Stream processing framework (such as Storm)?
> >
> > Navina, "the job bootstraps the data from the source" - do you mean that
> > you have a background process for this purpose or just listen to an
> > additional stream of change log from some other framework?
> >
> > Thanks,
> > Michael
> >
> > On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> > <nramesh@linkedin.com.invalid
> > > wrote:
> >
> > > Hi Michael,
> > > I agree with what Yan said. While nothing stops you from doing it, it
> is
> > > not encouraged as it affect throughput and realtime processing.
> > >
> > > {quote}
> > > It seems that Samza design suits very well "data transformation"
> > scenarios,
> > > what is not clear is how well can it support external services?
> > > {quote}
> > > We have some similar use-cases at LinkedIn where the Samza jobs need to
> > > query to external data sources. We do use a pattern where the job
> > > bootstraps the data from the source using a change-capture system like
> > > databus and buffer it locally, before processing from input streams.
> > > Depending on the scale of your data, this model may or may not work for
> > > you. However, there is no in-built support for this in Samza.
> > >
> > > Thanks!
> > > Navina
> > >
> > > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com>
> wrote:
> > >
> > > > Hi Michael,
> > > >
> > > > Samza is designed for high-throughput and realtime processing. If you
> > are
> > > > using HTTP request/external service, you may not retrieve the same
> > > > performance as not using it. However, technically speaking, there is
> > > > nothing blocking you to do this, (well, discouraged anyway :). Samza
> by
> > > > default does not provide this feature. So you maybe a little cautious
> > > when
> > > > implementing this.
> > > >
> > > > Thanks,
> > > >
> > > > Fang, Yan
> > > > yanfang724@gmail.com
> > > >
> > > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mikeskali@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > What would be the best approach for doing "blocking" operations in
> > > Samza?
> > > > >
> > > > > For example, we have a kafka stream of urls for which we need to
> > gather
> > > > > external data via HTTP (such as alexa rank, get the page title and
> > > > > headers..). Other scenarios include database access and decision
> > making
> > > > via
> > > > > a rule engine.
> > > > >
> > > > > Samza processes messages in a singe thread, HTTP requests might
> take
> > > > > hundreds of miliseconds. With the single threaded design the
> > throughput
> > > > > would be very limited, which can be solved with an asynchronous
> > > approach.
> > > > > However Samza documentation explicitely states
> > > > > "*You are strongly discouraged from using threads in your job’s
> > code*".
> > > > >
> > > > > It seems that Samza design suits very well "data transformation"
> > > > scenarios,
> > > > > what is not clear is how well can it support external services?
> > > > >
> > > > > Thanks,
> > > > > Michael Sklyar
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Navina R.
> > >
> >
>
>
>
> --
> Navina R.
>

Re: Asynchronous approach and samza

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
Hi Michael,
{quote}
Do you mean that in such a case Samza should be combined with another
Stream processing framework (such as Storm)?
{quote}
No. I didn't mean combining it with any other framework.

{quote}
"the job bootstraps the data from the source" - do you mean that
you have a background process for this purpose or just listen to an
additional stream of change log from some other framework?
{quote}
I didn't mean a background process. I meant just listening from a stream of
change log from a data source.

At LinkedIn, we use databus. The jobs will configure databus (for a give
data source) as one of the input streams for the job. Databus is a source
agnostic distributed change data capture system. You can find more
information here <https://github.com/linkedin/databus>. The advantage is
that the databus client is capable of "bootstrapping" from the source
automatically and then, switching to simply capture changes from the data
source. In this scenario, Samza doesn't do anything special, except that it
will continue consuming from databus stream when bootstrapping. Once
bootstrap is complete, the job can start processing events from other input
streams as well.

I hope my explanation clarifies your question. :)

Thanks!
Navina


On Mon, Sep 21, 2015 at 1:56 AM, Michael Sklyar <mi...@gmail.com> wrote:

> Thank you for your replies,
>
> I understand that making an external blocking request in a single event
> thread will result in extremely low throughput. However this can be solved
> by multi threading and/or asynchronous approach. It is clear that in any
> case using external services can never achieve the throughput of simple
> transformations. However most stream processing need, from time to time, to
> query some external storage, web service etc...
>
> Do you mean that in such a case Samza should be combined with another
> Stream processing framework (such as Storm)?
>
> Navina, "the job bootstraps the data from the source" - do you mean that
> you have a background process for this purpose or just listen to an
> additional stream of change log from some other framework?
>
> Thanks,
> Michael
>
> On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh
> <nramesh@linkedin.com.invalid
> > wrote:
>
> > Hi Michael,
> > I agree with what Yan said. While nothing stops you from doing it, it is
> > not encouraged as it affect throughput and realtime processing.
> >
> > {quote}
> > It seems that Samza design suits very well "data transformation"
> scenarios,
> > what is not clear is how well can it support external services?
> > {quote}
> > We have some similar use-cases at LinkedIn where the Samza jobs need to
> > query to external data sources. We do use a pattern where the job
> > bootstraps the data from the source using a change-capture system like
> > databus and buffer it locally, before processing from input streams.
> > Depending on the scale of your data, this model may or may not work for
> > you. However, there is no in-built support for this in Samza.
> >
> > Thanks!
> > Navina
> >
> > On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com> wrote:
> >
> > > Hi Michael,
> > >
> > > Samza is designed for high-throughput and realtime processing. If you
> are
> > > using HTTP request/external service, you may not retrieve the same
> > > performance as not using it. However, technically speaking, there is
> > > nothing blocking you to do this, (well, discouraged anyway :). Samza by
> > > default does not provide this feature. So you maybe a little cautious
> > when
> > > implementing this.
> > >
> > > Thanks,
> > >
> > > Fang, Yan
> > > yanfang724@gmail.com
> > >
> > > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mi...@gmail.com>
> > > wrote:
> > >
> > > > Hi,
> > > >
> > > > What would be the best approach for doing "blocking" operations in
> > Samza?
> > > >
> > > > For example, we have a kafka stream of urls for which we need to
> gather
> > > > external data via HTTP (such as alexa rank, get the page title and
> > > > headers..). Other scenarios include database access and decision
> making
> > > via
> > > > a rule engine.
> > > >
> > > > Samza processes messages in a singe thread, HTTP requests might take
> > > > hundreds of miliseconds. With the single threaded design the
> throughput
> > > > would be very limited, which can be solved with an asynchronous
> > approach.
> > > > However Samza documentation explicitely states
> > > > "*You are strongly discouraged from using threads in your job’s
> code*".
> > > >
> > > > It seems that Samza design suits very well "data transformation"
> > > scenarios,
> > > > what is not clear is how well can it support external services?
> > > >
> > > > Thanks,
> > > > Michael Sklyar
> > > >
> > >
> >
> >
> >
> > --
> > Navina R.
> >
>



-- 
Navina R.

Re: Asynchronous approach and samza

Posted by Michael Sklyar <mi...@gmail.com>.
Thank you for your replies,

I understand that making an external blocking request in a single event
thread will result in extremely low throughput. However this can be solved
by multi threading and/or asynchronous approach. It is clear that in any
case using external services can never achieve the throughput of simple
transformations. However most stream processing need, from time to time, to
query some external storage, web service etc...

Do you mean that in such a case Samza should be combined with another
Stream processing framework (such as Storm)?

Navina, "the job bootstraps the data from the source" - do you mean that
you have a background process for this purpose or just listen to an
additional stream of change log from some other framework?

Thanks,
Michael

On Mon, Sep 21, 2015 at 6:52 AM, Navina Ramesh <nramesh@linkedin.com.invalid
> wrote:

> Hi Michael,
> I agree with what Yan said. While nothing stops you from doing it, it is
> not encouraged as it affect throughput and realtime processing.
>
> {quote}
> It seems that Samza design suits very well "data transformation" scenarios,
> what is not clear is how well can it support external services?
> {quote}
> We have some similar use-cases at LinkedIn where the Samza jobs need to
> query to external data sources. We do use a pattern where the job
> bootstraps the data from the source using a change-capture system like
> databus and buffer it locally, before processing from input streams.
> Depending on the scale of your data, this model may or may not work for
> you. However, there is no in-built support for this in Samza.
>
> Thanks!
> Navina
>
> On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com> wrote:
>
> > Hi Michael,
> >
> > Samza is designed for high-throughput and realtime processing. If you are
> > using HTTP request/external service, you may not retrieve the same
> > performance as not using it. However, technically speaking, there is
> > nothing blocking you to do this, (well, discouraged anyway :). Samza by
> > default does not provide this feature. So you maybe a little cautious
> when
> > implementing this.
> >
> > Thanks,
> >
> > Fang, Yan
> > yanfang724@gmail.com
> >
> > On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mi...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > What would be the best approach for doing "blocking" operations in
> Samza?
> > >
> > > For example, we have a kafka stream of urls for which we need to gather
> > > external data via HTTP (such as alexa rank, get the page title and
> > > headers..). Other scenarios include database access and decision making
> > via
> > > a rule engine.
> > >
> > > Samza processes messages in a singe thread, HTTP requests might take
> > > hundreds of miliseconds. With the single threaded design the throughput
> > > would be very limited, which can be solved with an asynchronous
> approach.
> > > However Samza documentation explicitely states
> > > "*You are strongly discouraged from using threads in your job’s code*".
> > >
> > > It seems that Samza design suits very well "data transformation"
> > scenarios,
> > > what is not clear is how well can it support external services?
> > >
> > > Thanks,
> > > Michael Sklyar
> > >
> >
>
>
>
> --
> Navina R.
>

Re: Asynchronous approach and samza

Posted by Navina Ramesh <nr...@linkedin.com.INVALID>.
Hi Michael,
I agree with what Yan said. While nothing stops you from doing it, it is
not encouraged as it affect throughput and realtime processing.

{quote}
It seems that Samza design suits very well "data transformation" scenarios,
what is not clear is how well can it support external services?
{quote}
We have some similar use-cases at LinkedIn where the Samza jobs need to
query to external data sources. We do use a pattern where the job
bootstraps the data from the source using a change-capture system like
databus and buffer it locally, before processing from input streams.
Depending on the scale of your data, this model may or may not work for
you. However, there is no in-built support for this in Samza.

Thanks!
Navina

On Sun, Sep 20, 2015 at 7:55 PM, Yan Fang <ya...@gmail.com> wrote:

> Hi Michael,
>
> Samza is designed for high-throughput and realtime processing. If you are
> using HTTP request/external service, you may not retrieve the same
> performance as not using it. However, technically speaking, there is
> nothing blocking you to do this, (well, discouraged anyway :). Samza by
> default does not provide this feature. So you maybe a little cautious when
> implementing this.
>
> Thanks,
>
> Fang, Yan
> yanfang724@gmail.com
>
> On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mi...@gmail.com>
> wrote:
>
> > Hi,
> >
> > What would be the best approach for doing "blocking" operations in Samza?
> >
> > For example, we have a kafka stream of urls for which we need to gather
> > external data via HTTP (such as alexa rank, get the page title and
> > headers..). Other scenarios include database access and decision making
> via
> > a rule engine.
> >
> > Samza processes messages in a singe thread, HTTP requests might take
> > hundreds of miliseconds. With the single threaded design the throughput
> > would be very limited, which can be solved with an asynchronous approach.
> > However Samza documentation explicitely states
> > "*You are strongly discouraged from using threads in your job’s code*".
> >
> > It seems that Samza design suits very well "data transformation"
> scenarios,
> > what is not clear is how well can it support external services?
> >
> > Thanks,
> > Michael Sklyar
> >
>



-- 
Navina R.

Re: Asynchronous approach and samza

Posted by Yan Fang <ya...@gmail.com>.
Hi Michael,

Samza is designed for high-throughput and realtime processing. If you are
using HTTP request/external service, you may not retrieve the same
performance as not using it. However, technically speaking, there is
nothing blocking you to do this, (well, discouraged anyway :). Samza by
default does not provide this feature. So you maybe a little cautious when
implementing this.

Thanks,

Fang, Yan
yanfang724@gmail.com

On Sun, Sep 20, 2015 at 4:28 PM, Michael Sklyar <mi...@gmail.com> wrote:

> Hi,
>
> What would be the best approach for doing "blocking" operations in Samza?
>
> For example, we have a kafka stream of urls for which we need to gather
> external data via HTTP (such as alexa rank, get the page title and
> headers..). Other scenarios include database access and decision making via
> a rule engine.
>
> Samza processes messages in a singe thread, HTTP requests might take
> hundreds of miliseconds. With the single threaded design the throughput
> would be very limited, which can be solved with an asynchronous approach.
> However Samza documentation explicitely states
> "*You are strongly discouraged from using threads in your job’s code*".
>
> It seems that Samza design suits very well "data transformation" scenarios,
> what is not clear is how well can it support external services?
>
> Thanks,
> Michael Sklyar
>