You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by James Srinivasan <ja...@gmail.com> on 2018/11/16 16:10:38 UTC

Generating Remote URL for InvokeHTTP

Hi all,

I'm observing some slightly unusual behaviour with my flow and wanted
to run a possible explanation past the list. I'm using NiFi to scrape
a website consisting of nested data

e.g. GET http://server/2018/16/11/  returns a webpage full of links to
today's data

I'm using a combination of InvokeHTTP (to traverse the hierarchy) and
GetHTMLElement (to extract file and directory links), starting at the
root i.e. http://server/, then walking the years, months, days etc.

I'm generating the Remote URLs as

${invokehttp.request.url}${HTMLElement}

where invokehttp.request.url is the URL previously fetched for the day
listing in the hierarchy, and HTMLElement is the link to the file
extracted by GetHTMLElement.

Finally, I've routed "retry" and "failure" back to the InvokeHTTP
processor since my network is quite flaky.

Mostly everything is ok, but sometimes I manage to generate URLs which
look a bit like this:

http://server/2018/16/11/filename.jsonfilename.json

i.e. the filename part of the URL is duplicated

My thesis is that this is occurring when there is a network issue, so
the flowfile is routed to retry, then the InvokeHTTP processor
re-evaluates the expression for the Remote URL which leads to the
duplication of the filename (since invokehttp.request.url will have
been updated by the failed request).

Does this sound feasible? My proposed fix for my flow is to use a
single attribute for the URL and UpdateAttribute before InvokeHTTP to
set this, so that any retries don't munge the URL.

Many thanks, hope this makes sense.

James