You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "ASF subversion and git services (Jira)" <ji...@apache.org> on 2020/11/17 23:41:00 UTC

[jira] [Commented] (NIFI-7989) Add Hive "data drift" processor

    [ https://issues.apache.org/jira/browse/NIFI-7989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17234089#comment-17234089 ] 

ASF subversion and git services commented on NIFI-7989:
-------------------------------------------------------

Commit edc060bd92b689c4d610f5ac4aef83073167c8a6 in nifi's branch refs/heads/main from Matt Burgess
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=edc060b ]

NIFI-7989: Add UpdateHiveTable processors for data drift capability

NIFI-7989: Allow for optional blank line after optional column and partition headers
NIFI-7989: Incorporated review comments
NIFI-7989: Close Statement when finishing processing
NIFI-7989: Remove database name property, update output table attribute

This closes #4653.

Signed-off-by: Peter Turcsanyi <tu...@apache.org>


> Add Hive "data drift" processor
> -------------------------------
>
>                 Key: NIFI-7989
>                 URL: https://issues.apache.org/jira/browse/NIFI-7989
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Matt Burgess
>            Assignee: Matt Burgess
>            Priority: Major
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It would be nice to have a Hive processor (one for each Hive NAR) that could check an incoming record-based flowfile against a destination table, and either add columns and/or partition values, or even create the table if it does not exist. Such a processor could be used in a flow where the incoming data's schema can change and we want to be able to write it to a Hive table, preferably by using PutHDFS, PutParquet, or PutORC to place it directly where it can be queried.
> Such a processor should be able to use a HiveConnectionPool to execute any DDL (ALTER TABLE ADD COLUMN, e.g.) necessary to make the table match the incoming data. For Partition Values, they could be provided via a property that supports Expression Language. In such a case, an ALTER TABLE would be issued to add the partition directory.
> Whether the table is created or updated, and whether there are partition values to consider, an attribute should be written to the outgoing flowfile corresponding to the location of the table (and any associated partitions). This supports the idea of having a flow that updates a Hive table based on the incoming data, and then allows the user to put the flowfile directly into the destination location (PutHDFS, e.g.) instead of having to load it using HiveQL or being subject to the restrictions of Hive Streaming tables (ORC-backed, transactional, etc.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)