You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nifi.apache.org by yogesh sharma <yo...@LIVE.COM> on 2016/08/10 10:31:26 UTC

Facing issue with Duplicate data

Hello Team,


I am new to Apache Nifi and started working on it. Currently we have Nifi installed in cluster and that has three nodes.


I am facing duplicate data while implementing below use-case,


Use Case : I need to fetch data from US Earthquake(http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson) website and load data in incremental way.


For that I am using below processors,

*         InvokeHTTP

*         SplitJson

*         EvaluateJsonPath

*         ReplaceText

*         MergerContent

*         PutFile/PutHDFS

I haI have attached my template as well.
Issue which I am facing is duplicate data because every time InvokeHttp hit to API and get the available details. But it might fetch the existing data as well so it load same data again in Target.

I need to load only unique data into taget. I found DetectDuplicate but not know how to configure it, Can you tell me how to configure services in cluster. Or itf you fhave any other solution so please let me know. We want to use Nifi in our upcoming project but facing issue while implementing  small POCs.


ThaThanks

YogYogesh (+91-9689942310)

Tha


If

I

Re: Facing issue with Duplicate data

Posted by Joe Witt <jo...@gmail.com>.

Hello Yogesh,

Couple of quick pointers that might help.
1) Set InvokeHTTP to only run on primary node. You don't want all
nodes in the cluster pulling the data
2) After SplitJSON use site-to-site to send the splits across the
cluster to parallelize the work.  Sounds like overkill for what you've
described but will certainly scale and get the point across.
3) You're pulling from an endpoint that does not offer queuing
semantics so of course duplicates are a thing to consider.  Given that
it is an hourly dataset it appears i would add DetectDuplicate right
after the http pull of JSON and i'd schedule the pull to happen ever
10 or 15 or 30 minutes or so.  Take a look at the docs for setting up
duplicate detection.

The pattern you've laid out makes sense, is quite straightforward, and
is common.

Thanks
Joe

On Wed, Aug 10, 2016 at 6:31 AM, yogesh sharma <yo...@live.com> wrote:
> Hello Team,
>
>
> I am new to Apache Nifi and started working on it. Currently we have Nifi installed in cluster and that has three nodes.
>
>
> I am facing duplicate data while implementing below use-case,
>
>
> Use Case : I need to fetch data from US Earthquake(http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_hour.geojson) website and load data in incremental way.
>
>
> For that I am using below processors,
>
> *         InvokeHTTP
>
> *         SplitJson
>
> *         EvaluateJsonPath
>
> *         ReplaceText
>
> *         MergerContent
>
> *         PutFile/PutHDFS
>
> I haI have attached my template as well.
> Issue which I am facing is duplicate data because every time InvokeHttp hit to API and get the available details. But it might fetch the existing data as well so it load same data again in Target.
>
> I need to load only unique data into taget. I found DetectDuplicate but not know how to configure it, Can you tell me how to configure services in cluster. Or itf you fhave any other solution so please let me know. We want to use Nifi in our upcoming project but facing issue while implementing  small POCs.
>
>
> ThaThanks
>
> YogYogesh (+91-9689942310)
>
> Tha
>
>
> If
>
> I
>
>
>
>
>
>