You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@nifi.apache.org by "Robert Liszli (Jira)" <ji...@apache.org> on 2022/12/08 13:11:00 UTC
[jira] [Updated] (NIFI-10556) Create processor to support DeltaLake tables

     [ https://issues.apache.org/jira/browse/NIFI-10556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Robert Liszli updated NIFI-10556:
---------------------------------
    Description: 
*Plan for the new processor*

The new processor will use the Delta Standalone library to generate the delta table for the parquet data files. This processor also capable to process other processors output file and upload it to the data store.

*Processors input:*
 * The path of the parquet files(a single directory). Located at local filesystem or in cloud storage(S3, GCP or Azure).
 * Structure of the parquet file in json format.
 * If we want the processor to process other processors output file, then the attribute names of the output files path and filename should be set
 * Partition columns, separated by comma

*Processors parameter:*
 * Dropdown selector for storage type selection.
 * Credentials for the selected storage type.

*On Trigger:*
 * If we want the processor to process other processors output file, first it copies the new file to the desired data directory.
 * The processor will compare the files in the data directory to the files already added to the delta table. If new data file exist, it will add it to the delta table.
 * If there is no delta table exists, the processor will create one and the delta table will be generated.

*Output of the processor:*
 * Up to date Delta table in the chosen storage system.

 

Delta Standalone: [https://github.com/delta-io/connectors#delta-standalone]

  was:
*Plan for the new processor*

The new processor will use the Delta Standalone library to generate delta table for a set of parquet data files located locally or in cloud storage.

*Processors input:*
 * The path of the parquet files(a single directory). Located at local filesystem or in cloud storage(S3, GCP or Azure).
 * Structure of the parquet file in json format.

*Processors parameter:*
 * Dropdown selector for storage type selection.
 * Credentials for the selected storage type.

*On Trigger:*
 * The processor will compare the files in the data directory to the files already added to the delta table. If new data file exist, it will add it to the delta table.
 * If there is no delta table exists, the processor will create one and the delta table will be generated.

*Output of the processor:*
 * Up to date Delta table in the chosen storage system.

 

Delta Standalone: [https://github.com/delta-io/connectors#delta-standalone]


> Create processor to support DeltaLake tables
> --------------------------------------------
>
>                 Key: NIFI-10556
>                 URL: https://issues.apache.org/jira/browse/NIFI-10556
>             Project: Apache NiFi
>          Issue Type: New Feature
>          Components: Extensions
>            Reporter: Robert Liszli
>            Assignee: Robert Liszli
>            Priority: Major
>         Attachments: processor_usages.png
>
>          Time Spent: 1h
>  Remaining Estimate: 0h
>
> *Plan for the new processor*
> The new processor will use the Delta Standalone library to generate the delta table for the parquet data files. This processor also capable to process other processors output file and upload it to the data store.
> *Processors input:*
>  * The path of the parquet files(a single directory). Located at local filesystem or in cloud storage(S3, GCP or Azure).
>  * Structure of the parquet file in json format.
>  * If we want the processor to process other processors output file, then the attribute names of the output files path and filename should be set
>  * Partition columns, separated by comma
> *Processors parameter:*
>  * Dropdown selector for storage type selection.
>  * Credentials for the selected storage type.
> *On Trigger:*
>  * If we want the processor to process other processors output file, first it copies the new file to the desired data directory.
>  * The processor will compare the files in the data directory to the files already added to the delta table. If new data file exist, it will add it to the delta table.
>  * If there is no delta table exists, the processor will create one and the delta table will be generated.
> *Output of the processor:*
>  * Up to date Delta table in the chosen storage system.
>  
> Delta Standalone: [https://github.com/delta-io/connectors#delta-standalone]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)