You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Jeff - Data Bean Australia <da...@gmail.com> on 2016/02/12 06:28:48 UTC

Thread Control of Processors

Hi

I got a use case like this:

There is a file that contains thousands of items, each on one line. For
each item, it will trigger one GetHTTP processor to fetch some data.

Here is what I am trying to do:

1. Fetch this file
2. For each line, I generate one file using SplitText
3. Drive GetHTTP downstream.

However, given there are more than 2000 lines, more than 2000 HTTP Get
processes will be created and flooded into one web site, which doesn't
sound like a good idea. So I would like to control the processors, so that
only a couple of them will be running at the same time, and maybe delay for
a couple of seconds after finish.

How can I do that in NiFi?

Thanks,
Jeff




-- 
Data Bean - A Big Data Solution Provider in Australia.

Re: Thread Control of Processors

Posted by Jeff - Data Bean Australia <da...@gmail.com>.
Just realise that GetHTTP doesn't accept input:

@Tags({"get", "fetch", "poll", "http", "https", "ingest", "source",
"input"})
*@InputRequirement(Requirement.INPUT_FORBIDDEN)*
@CapabilityDescription("Fetches a file via HTTP")
@WritesAttributes({
    @WritesAttribute(attribute = "filename", description = "The filename is
set to the name of the file on the remote server"),
    @WritesAttribute(attribute = "mime.type", description = "The MIME Type
of the FlowFile, as reported by the HTTP Content-Type header")
})
public class GetHTTP extends AbstractSessionFactoryProcessor

Given that, GetHTTP is not suitable in this use case. I can only use
InvokeHTTP alone or combination of ControlRate and InvokeHTTP.



On Sat, Feb 13, 2016 at 11:38 AM, Jeff - Data Bean Australia <
databean.au@gmail.com> wrote:

> Thank you both, Joe and Simon, for pointing out InvokeHTTP and ControlRate
> to me.
>
> Regarding InvokeHTTP, I got a couple of questions for you.
>
> Both InvokeHTTP and GetHTTP has settings as "Concurrent Tasks" and "Run
> schedule", if my use case is only about GET method, why InvokeHTTP is
> better? I noticed that InvokeHTTP inherits from AbstractProcessor, while
> AbstractProcessor and GetHTTP share the same parent,
> AbstractSessionFactoryProcessor, can you explain what enhancement
> InvokeHTTP gets by going one further down the hierarchy?
>
> ControlRate and InvokeHTTP are at the same level regarding class
> hierarchy. Simon pointed out that ControlRate and InvokeHTTP can work
> together for more intuitive control. This looks wonderful. Both of you
> mentioned backpressure, what's the different regarding backpressure when
> using InvokeHTTP alone and combine it with ControlRate?
>
> Can ControlRate work with GetHTTP also? If yes, what would be the
> different?
>
> Thanks,
> Jeff
>
> On Fri, Feb 12, 2016 at 7:06 PM, Simon Ball <sb...@hortonworks.com> wrote:
>
>> Jeff,
>>
>> Another approach I've used with some success is the ControlRate processor
>> before the InvokeHttp, which gives you an intuitive way of limiting the
>> number of requests in a specified time interval. Note however that this
>> does need to be combined with back pressure control to prevent requests
>> queuing behind the InvokeHttp. Feeding failure retries back into the
>> ControlRate, or a funnel before the invoke also gives you a bit more
>> control over groupings of retries for example.
>>
>> Simon
>>
>> Sent from my iPhone
>>
>> —
>> Simon Elliston Ball
>> Solutions Engineer - EMEA
>> +44 7930 424111 <+44%207930%20424111>
>> Hortonworks - We Do Hadoop
>>
>>
>> On 12 Feb 2016, at 05:34, Joe Witt <jo...@gmail.com> wrote:
>>
>> Jeff,
>>
>> This is definitely a strong use case for nifi.
>>
>> It might be that InvokeHTTP is the better choice here.
>>
>> If what you'd like to do is effectively throttle the rate at which you
>> hit the web service with the InvokeHttp calls you can schedule that
>> processor to run as often as you like (for example every 100 ms).
>> Then use backpressure settings on the queues feeding that InvokeHTTP
>> process.  Effectively you can control where data will back up in the
>> flow while it is being throttled.
>>
>> If the lookup data is a good candidate for caching then there may be
>> other great options to make this more efficient.
>>
>> Perhaps you can share a flow template of what you have so far and we
>> can make recommendations on next steps?
>>
>> Thanks
>> Joe
>>
>> On Fri, Feb 12, 2016 at 12:28 AM, Jeff - Data Bean Australia
>> <da...@gmail.com> wrote:
>>
>> Hi
>>
>>
>> I got a use case like this:
>>
>>
>> There is a file that contains thousands of items, each on one line. For
>> each
>>
>> item, it will trigger one GetHTTP processor to fetch some data.
>>
>>
>> Here is what I am trying to do:
>>
>>
>> 1. Fetch this file
>>
>> 2. For each line, I generate one file using SplitText
>>
>> 3. Drive GetHTTP downstream.
>>
>>
>> However, given there are more than 2000 lines, more than 2000 HTTP Get
>>
>> processes will be created and flooded into one web site, which doesn't
>> sound
>>
>> like a good idea. So I would like to control the processors, so that only
>> a
>>
>> couple of them will be running at the same time, and maybe delay for a
>>
>> couple of seconds after finish.
>>
>>
>> How can I do that in NiFi?
>>
>>
>> Thanks,
>>
>> Jeff
>>
>>
>>
>>
>>
>> --
>>
>> Data Bean - A Big Data Solution Provider in Australia.
>>
>>
>>
>
>
> --
> Data Bean - A Big Data Solution Provider in Australia.
>



-- 
Data Bean - A Big Data Solution Provider in Australia.

Re: Thread Control of Processors

Posted by Jeff - Data Bean Australia <da...@gmail.com>.
Thank you both, Joe and Simon, for pointing out InvokeHTTP and ControlRate
to me.

Regarding InvokeHTTP, I got a couple of questions for you.

Both InvokeHTTP and GetHTTP has settings as "Concurrent Tasks" and "Run
schedule", if my use case is only about GET method, why InvokeHTTP is
better? I noticed that InvokeHTTP inherits from AbstractProcessor, while
AbstractProcessor and GetHTTP share the same parent,
AbstractSessionFactoryProcessor, can you explain what enhancement
InvokeHTTP gets by going one further down the hierarchy?

ControlRate and InvokeHTTP are at the same level regarding class hierarchy.
Simon pointed out that ControlRate and InvokeHTTP can work together for
more intuitive control. This looks wonderful. Both of you mentioned
backpressure, what's the different regarding backpressure when using
InvokeHTTP alone and combine it with ControlRate?

Can ControlRate work with GetHTTP also? If yes, what would be the
different?

Thanks,
Jeff

On Fri, Feb 12, 2016 at 7:06 PM, Simon Ball <sb...@hortonworks.com> wrote:

> Jeff,
>
> Another approach I've used with some success is the ControlRate processor
> before the InvokeHttp, which gives you an intuitive way of limiting the
> number of requests in a specified time interval. Note however that this
> does need to be combined with back pressure control to prevent requests
> queuing behind the InvokeHttp. Feeding failure retries back into the
> ControlRate, or a funnel before the invoke also gives you a bit more
> control over groupings of retries for example.
>
> Simon
>
> Sent from my iPhone
>
> —
> Simon Elliston Ball
> Solutions Engineer - EMEA
> +44 7930 424111 <+44%207930%20424111>
> Hortonworks - We Do Hadoop
>
>
> On 12 Feb 2016, at 05:34, Joe Witt <jo...@gmail.com> wrote:
>
> Jeff,
>
> This is definitely a strong use case for nifi.
>
> It might be that InvokeHTTP is the better choice here.
>
> If what you'd like to do is effectively throttle the rate at which you
> hit the web service with the InvokeHttp calls you can schedule that
> processor to run as often as you like (for example every 100 ms).
> Then use backpressure settings on the queues feeding that InvokeHTTP
> process.  Effectively you can control where data will back up in the
> flow while it is being throttled.
>
> If the lookup data is a good candidate for caching then there may be
> other great options to make this more efficient.
>
> Perhaps you can share a flow template of what you have so far and we
> can make recommendations on next steps?
>
> Thanks
> Joe
>
> On Fri, Feb 12, 2016 at 12:28 AM, Jeff - Data Bean Australia
> <da...@gmail.com> wrote:
>
> Hi
>
>
> I got a use case like this:
>
>
> There is a file that contains thousands of items, each on one line. For
> each
>
> item, it will trigger one GetHTTP processor to fetch some data.
>
>
> Here is what I am trying to do:
>
>
> 1. Fetch this file
>
> 2. For each line, I generate one file using SplitText
>
> 3. Drive GetHTTP downstream.
>
>
> However, given there are more than 2000 lines, more than 2000 HTTP Get
>
> processes will be created and flooded into one web site, which doesn't
> sound
>
> like a good idea. So I would like to control the processors, so that only a
>
> couple of them will be running at the same time, and maybe delay for a
>
> couple of seconds after finish.
>
>
> How can I do that in NiFi?
>
>
> Thanks,
>
> Jeff
>
>
>
>
>
> --
>
> Data Bean - A Big Data Solution Provider in Australia.
>
>
>


-- 
Data Bean - A Big Data Solution Provider in Australia.

Re: Thread Control of Processors

Posted by Simon Ball <sb...@hortonworks.com>.
Jeff,

Another approach I've used with some success is the ControlRate processor before the InvokeHttp, which gives you an intuitive way of limiting the number of requests in a specified time interval. Note however that this does need to be combined with back pressure control to prevent requests queuing behind the InvokeHttp. Feeding failure retries back into the ControlRate, or a funnel before the invoke also gives you a bit more control over groupings of retries for example.

Simon

Sent from my iPhone

-
Simon Elliston Ball
Solutions Engineer - EMEA
+44 7930 424111<tel:+44%207930%20424111>
Hortonworks - We Do Hadoop


On 12 Feb 2016, at 05:34, Joe Witt <jo...@gmail.com>> wrote:

Jeff,

This is definitely a strong use case for nifi.

It might be that InvokeHTTP is the better choice here.

If what you'd like to do is effectively throttle the rate at which you
hit the web service with the InvokeHttp calls you can schedule that
processor to run as often as you like (for example every 100 ms).
Then use backpressure settings on the queues feeding that InvokeHTTP
process.  Effectively you can control where data will back up in the
flow while it is being throttled.

If the lookup data is a good candidate for caching then there may be
other great options to make this more efficient.

Perhaps you can share a flow template of what you have so far and we
can make recommendations on next steps?

Thanks
Joe

On Fri, Feb 12, 2016 at 12:28 AM, Jeff - Data Bean Australia
<da...@gmail.com>> wrote:
Hi

I got a use case like this:

There is a file that contains thousands of items, each on one line. For each
item, it will trigger one GetHTTP processor to fetch some data.

Here is what I am trying to do:

1. Fetch this file
2. For each line, I generate one file using SplitText
3. Drive GetHTTP downstream.

However, given there are more than 2000 lines, more than 2000 HTTP Get
processes will be created and flooded into one web site, which doesn't sound
like a good idea. So I would like to control the processors, so that only a
couple of them will be running at the same time, and maybe delay for a
couple of seconds after finish.

How can I do that in NiFi?

Thanks,
Jeff




--
Data Bean - A Big Data Solution Provider in Australia.


Re: Thread Control of Processors

Posted by Joe Witt <jo...@gmail.com>.
Jeff,

This is definitely a strong use case for nifi.

It might be that InvokeHTTP is the better choice here.

If what you'd like to do is effectively throttle the rate at which you
hit the web service with the InvokeHttp calls you can schedule that
processor to run as often as you like (for example every 100 ms).
Then use backpressure settings on the queues feeding that InvokeHTTP
process.  Effectively you can control where data will back up in the
flow while it is being throttled.

If the lookup data is a good candidate for caching then there may be
other great options to make this more efficient.

Perhaps you can share a flow template of what you have so far and we
can make recommendations on next steps?

Thanks
Joe

On Fri, Feb 12, 2016 at 12:28 AM, Jeff - Data Bean Australia
<da...@gmail.com> wrote:
> Hi
>
> I got a use case like this:
>
> There is a file that contains thousands of items, each on one line. For each
> item, it will trigger one GetHTTP processor to fetch some data.
>
> Here is what I am trying to do:
>
> 1. Fetch this file
> 2. For each line, I generate one file using SplitText
> 3. Drive GetHTTP downstream.
>
> However, given there are more than 2000 lines, more than 2000 HTTP Get
> processes will be created and flooded into one web site, which doesn't sound
> like a good idea. So I would like to control the processors, so that only a
> couple of them will be running at the same time, and maybe delay for a
> couple of seconds after finish.
>
> How can I do that in NiFi?
>
> Thanks,
> Jeff
>
>
>
>
> --
> Data Bean - A Big Data Solution Provider in Australia.