You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Erik Anderson <ea...@pobox.com> on 2019/05/24 14:17:13 UTC

Advice for proper workflow for very large datasets->HDFS/HIVE

Take a look at the attached diagram. I hope it helps to clarify the question.

I am trying to use NiFi to move VERY large datasets from Oracle/Postgres/etc... to a data lake.

When you have an amazing tool like NiFi, everything looks like a NiFi problem. :)

We have many many large datasets, many in the TB range. Some update every 15 minutes, some update daily/weekly/monthly.

For now, lets discount the 15 minute dataset as its TimesSeries+Geospatial

Am I right in assuming these 2 flows will work?

QueryDatabaseTable -> PutHDFS
or
GenerateTableFetch->ExecuteSQL->PutHDFS

Ultimately this will be exposed via HIVE. Possibly better to use PutHiveQL or PutHiveStreaming?

Is there a good article out there about tackling small & large datasets with various different periodicity's? Even better would be a ready made flow template, with settings, that I could download and tweak. Ideally a flow template that I can reuse across all datasets and Database technologies.

I am sure this isnt a new problem and someone out there has tacked it.

Thx,

Erik Anderson
Bloomberg