You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@nifi.apache.org by Dweep Sharma <dw...@redbus.com> on 2019/07/05 08:24:36 UTC

Kafka to parquet to s3

Hi,

I have been trying to move some JSON data to S3 in Parquet format.

From Kafka to s3 is straight forward but I cannot seem to find the right
processor to convert JSON to parquet and move it to s3.

putparquet does not take s3 bucket or credentials and requires hadoop to be
installed.

Can someone please share a blog or steps to achieve this. Thanks in advance

-Dweep

-- 
*::DISCLAIMER::

----------------------------------------------------------------------------------------------------------------------------------------------------


The contents of this e-mail and any attachments are confidential and 
intended for the named recipient(s) only.E-mail transmission is not 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted,lost, destroyed, arrive late or incomplete, or may contain 
viruses in transmission. The e mail and its contents(with or without 
referred errors) shall therefore not attach any liability on the originator 
or redBus.com. Views or opinions, if any, presented in this email are 
solely those of the author and may not necessarily reflect the views or 
opinions of redBus.com. Any form of reproduction, dissemination, copying, 
disclosure, modification,distribution and / or publication of this message 
without the prior written consent of authorized representative of redbus. 
<http://redbus.in/>com is strictly prohibited. If you have received this 
email in error please delete it and notify the sender immediately.Before 
opening any email and/or attachments, please check them for viruses and 
other defects.*

RE: Kafka to parquet to s3

Posted by "Williams, Jim" <jw...@alertlogic.com>.
Dweep,

 

The data I am moving into S3 is already some fairly large sets of files, since they are a bulk export from a SaaS application.  Thus, the number of files which were being PUT to S3 was not a huge consideration.  However, since the Parquet files are to be consumed by Redshift Spectrum I had an interest in consolidating flow files containing like objects into a single flow file prior to Parquet conversion.  I used the MergeRecord processor [1] to do this.

 

So, to amplify on the flow, it really looks more like this:

 

(Get stuff in JSON format) --> ConvertRecord --> MergeRecord --> ConvertAvroToParquet --> PutS3

 

 

This is not really a “real-time streaming flow” it’s more batch-oriented.  There is a delay in the flow (which is acceptable to us) for the MergeRecord processor to collect and merge possibly several flow files into a bigger flow file.

 

 

[1] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.MergeRecord/index.html

 

 

 

Warm regards,

 


 <https://www.alertlogic.com/> 

Jim Williams | Principal Database Developer


O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |  <http://www.alertlogic.com/> alertlogic.com  <https://twitter.com/alertlogic>  <https://www.linkedin.com/company/alert-logic> 


 



 

From: Dweep Sharma <dw...@redbus.com> 
Sent: Sunday, July 14, 2019 1:07 AM
To: users@nifi.apache.org
Subject: Re: Kafka to parquet to s3

 

Thanks Jim for the insights on advantages, this worked for me as well. 

 

Any thoughts on partitioning and filesize so the S3 PUT costs are not too high?

 

I do not see options on the convertavrotoparquet for this

 

-Dweep

 

On Mon, Jul 8, 2019 at 6:09 PM Williams, Jim <jwilliams@alertlogic.com <ma...@alertlogic.com> > wrote:

Dweep,

 

I have been working on a project where Parquet files are being written to S3.  I’ve had the liberty to use the most up-to-date version of Nifi, so I have implemented this on 1.9.2.

 

The approach I have taken is something like this:

 

(Get stuff in JSON format) --> ConvertRecord --> ConvertAvroToParquet --> PutS3

 

The ConvertRecord [1] processor changes the flow files from JSON to Avro.  Although it is possible to use schema inference with this processor, it is something we have not leveraged yet.  The ConvertAvroToParquet [2] converts the flow file, but does not write it out to a local or HDFS file system like the PutParquet [3] processor would.

 

Implementing the flow in this way gives a couple advantages:

 

1.	We do not need to use the PutParquet processor

a.	Extra configuration on cluster nodes is avoided for writing directly to S3 with this processor
b.	Writing to a local or HDFS filesystem and then copying to S3 is avoided

2.	We can use the native authentication methods which come with the S3 processor

a.	Roles associated with EC2 instances are leveraged, which makes cluster deployment much simpler

 

We have been happy using this pattern for the past couple months.  I am watching for progress on Nifi-6089 [4] for a Parquet Record Reader/Writer with interest.

 

 

 

[1] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ConvertRecord/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dstandard-2Dnar_1.9.2_org.apache.nifi.processors.standard.ConvertRecord_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=-xRMuIesOHDc-Qh9SKoVvE7x9EZVnHvR5oTmpj_9ccM&e=> 

[2] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.ConvertAvroToParquet/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.ConvertAvroToParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=H7S26aP4FKFn_xmTAuMgfgn9Wqf0haoIWTLS1T9b-qE&e=> 

[3] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.PutParquet/index.html <https://urldefense.proofpoint.com/v2/url?u=https-3A__nifi.apache.org_docs_nifi-2Ddocs_components_org.apache.nifi_nifi-2Dparquet-2Dnar_1.9.2_org.apache.nifi.processors.parquet.PutParquet_index.html&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=umNhX0Fi2wpJhFqSFsVNDml65cyD02cLMxVK4k6mneg&e=> 

[4] - https://issues.apache.org/jira/browse/NIFI-6089 <https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_NIFI-2D6089&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=lIH0BBMqActG0dOTWDA_M54FV3vlKDhUj2nPI0EPnoU&e=> 

 

 

Warm regards,

 


 <https://www.alertlogic.com/> 

Jim Williams | Principal Database Developer


O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com <ma...@alertlogic.com>  |  <http://www.alertlogic.com/> alertlogic.com  <https://urldefense.proofpoint.com/v2/url?u=https-3A__twitter.com_alertlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=57YoBOOxUJfq3IEijhUXJ2nAaN8e-0m5S13SMMIJvU8&e=>  <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_alert-2Dlogic&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=CzrU0ZuogIyzWa-t3_5ZRY2Q31_a391RD-ODMOxcNRc&e=> 


 



 

From: Bryan Bende <bbende@gmail.com <ma...@gmail.com> > 
Sent: Saturday, July 6, 2019 11:49 AM
To: users@nifi.apache.org <ma...@nifi.apache.org> 
Subject: Re: Kafka to parquet to s3

 

Currently put and fetch parquet are tied to the Hadoop API so they need the config files. As mentioned you can create a core-site with a local file system, and then you could use another part of the flow to pick up the file using ListFile -> FetchFile -> PutS3Object.

 

There is a way to write to S3 directly from PutParquet and PutHdfs, but it requires additional jars and config, and is honestly harder to setup then just using the above approach.

 

There is also a JIRA to implement a parquet record reader and writer which would then let you use ConvertRecord to go from JSON to parquet, and then to PutS3Object.

 

I think the error mentioned means you have the same field name at different levels in your JSON and that is not allowed in an Avro schema.

 

On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dweep.sharma@redbus.com <ma...@redbus.com> > wrote:

Thanks Shanker, 

 

But I do not see options in putparquet for s3 bucket/credential, I am assuming I need to push this to a local store and then add the puts3object processor on top of that ?

 

Also, as record reader in putparquet I am using the JsonTreeReader with defaults (InferSchema)  and I get the error "Failed to write due to can't redefine: org.apache.nifi.addresstype: 

 

Some files however do get written. Are the default settings good or am I missing something ?

 

 

-Dweep

 

 

 

On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <aperepel@gmail.com <ma...@gmail.com> > wrote:

Interestingly enough, the ORC processor in NiFi can just use defaults if hadoop configs aren't provided, no additional config steps required. Is it something which can be improved for PutParquet maybe?

 

Andrew

 

 

 

On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <shanker.sneh@zoomcar.com <ma...@zoomcar.com> > wrote:

Hello Dweep,

 

In putparquet processor you can set the attribute 'Hadoop Configuration Resources' to a core-site.xml file whose content can be somewhat like below:

 

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>file:///reservoir-dl <file:///reservoir-dl%3c/value> </value>

    </property>

</configuration>

 

Here the file:///reservoir-dl could be your path where in-transit parquet files have be written to -- before being pushed to S3.

More importantly, you do not need Hadoop to be installed. You can just place the core-site.xml file on your NiFi nodes and get started.

 

 

On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dweep.sharma@redbus.com <ma...@redbus.com> > wrote:

Hi, 

 

I have been trying to move some JSON data to S3 in Parquet format. 

 

From Kafka to s3 is straight forward but I cannot seem to find the right processor to convert JSON to parquet and move it to s3.

 

putparquet does not take s3 bucket or credentials and requires hadoop to be installed. 

 

Can someone please share a blog or steps to achieve this. Thanks in advance

 

-Dweep


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.




 

-- 

Best,

Sneh


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.

-- 

Sent from Gmail Mobile


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=d_kR1GWPNss4CLyS4-44v2ghw-2JOU_7mHjFqZNMzoQ&s=xRnGFwEijCDoiz5rOh9c_74-BTK-vsyD06-_YcmZURs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.


Re: Kafka to parquet to s3

Posted by Dweep Sharma <dw...@redbus.com>.
Thanks Jim for the insights on advantages, this worked for me as well.

Any thoughts on partitioning and filesize so the S3 PUT costs are not too
high?

I do not see options on the convertavrotoparquet for this

-Dweep

On Mon, Jul 8, 2019 at 6:09 PM Williams, Jim <jw...@alertlogic.com>
wrote:

> Dweep,
>
>
>
> I have been working on a project where Parquet files are being written to
> S3.  I’ve had the liberty to use the most up-to-date version of Nifi, so I
> have implemented this on 1.9.2.
>
>
>
> The approach I have taken is something like this:
>
>
>
> (Get stuff in JSON format) à ConvertRecord à ConvertAvroToParquet à PutS3
>
>
>
> The ConvertRecord [1] processor changes the flow files from JSON to Avro.
> Although it is possible to use schema inference with this processor, it is
> something we have not leveraged yet.  The ConvertAvroToParquet [2] converts
> the flow file, but does not write it out to a local or HDFS file system
> like the PutParquet [3] processor would.
>
>
>
> Implementing the flow in this way gives a couple advantages:
>
>
>
>    1. We do not need to use the PutParquet processor
>       1. Extra configuration on cluster nodes is avoided for writing
>       directly to S3 with this processor
>       2. Writing to a local or HDFS filesystem and then copying to S3 is
>       avoided
>    2. We can use the native authentication methods which come with the S3
>    processor
>       1. Roles associated with EC2 instances are leveraged, which makes
>       cluster deployment much simpler
>
>
>
> We have been happy using this pattern for the past couple months.  I am
> watching for progress on Nifi-6089 [4] for a Parquet Record Reader/Writer
> with interest.
>
>
>
>
>
>
>
> [1] -
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ConvertRecord/index.html
>
> [2] -
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.ConvertAvroToParquet/index.html
>
> [3] -
> https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.PutParquet/index.html
>
> [4] - https://issues.apache.org/jira/browse/NIFI-6089
>
>
>
>
>
> Warm regards,
>
>
>
> [image:
> https://go.alertlogic.com/rs/239-ZBX-439/images/ESign-Alert%20Logic-Logo.png]
> <https://www.alertlogic.com/>
>
> *Jim Williams* | Principal Database Developer
>
> O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |
> alertlogic.com <http://www.alertlogic.com/> [image:
> https://go.alertlogic.com/rs/239-ZBX-439/images/t.png]
> <https://twitter.com/alertlogic>[image:
> https://go.alertlogic.com/rs/239-ZBX-439/images/L.png]
> <https://www.linkedin.com/company/alert-logic>
>
>
>
> [image: cid:image002.png@01D419D4.F5B618E0]
>
>
>
> *From:* Bryan Bende <bb...@gmail.com>
> *Sent:* Saturday, July 6, 2019 11:49 AM
> *To:* users@nifi.apache.org
> *Subject:* Re: Kafka to parquet to s3
>
>
>
> Currently put and fetch parquet are tied to the Hadoop API so they need
> the config files. As mentioned you can create a core-site with a local file
> system, and then you could use another part of the flow to pick up the file
> using ListFile -> FetchFile -> PutS3Object.
>
>
>
> There is a way to write to S3 directly from PutParquet and PutHdfs, but it
> requires additional jars and config, and is honestly harder to setup then
> just using the above approach.
>
>
>
> There is also a JIRA to implement a parquet record reader and writer which
> would then let you use ConvertRecord to go from JSON to parquet, and then
> to PutS3Object.
>
>
>
> I think the error mentioned means you have the same field name at
> different levels in your JSON and that is not allowed in an Avro schema.
>
>
>
> On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dw...@redbus.com>
> wrote:
>
> Thanks Shanker,
>
>
>
> But I do not see options in putparquet for s3 bucket/credential, I am
> assuming I need to push this to a local store and then add the puts3object
> processor on top of that ?
>
>
>
> Also, as record reader in putparquet I am using the JsonTreeReader with
> defaults (InferSchema)  and I get the error "Failed to write due to can't
> redefine: org.apache.nifi.addresstype:
>
>
>
> Some files however do get written. Are the default settings good or am I
> missing something ?
>
>
>
>
>
> -Dweep
>
>
>
>
>
>
>
> On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <ap...@gmail.com> wrote:
>
> Interestingly enough, the ORC processor in NiFi can just use defaults if
> hadoop configs aren't provided, no additional config steps required. Is it
> something which can be improved for PutParquet maybe?
>
>
>
> Andrew
>
>
>
>
>
>
>
> On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <sh...@zoomcar.com>
> wrote:
>
> Hello Dweep,
>
>
>
> In putparquet processor you can set the attribute '*Hadoop Configuration
> Resources*' to a *core-site.xml* file whose content can be somewhat like
> below:
>
>
>
> <configuration>
>
>     <property>
>
>         <name>fs.defaultFS</name>
>
>         <value>file:///reservoir-dl</value>
>
>     </property>
>
> </configuration>
>
>
>
> Here the file:///reservoir-dl could be your path where in-transit parquet
> files have be written to -- before being pushed to S3.
>
> More importantly, you *do not* need Hadoop to be installed. You can just
> place the core-site.xml file on your NiFi nodes and get started.
>
>
>
>
>
> On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dw...@redbus.com>
> wrote:
>
> Hi,
>
>
>
> I have been trying to move some JSON data to S3 in Parquet format.
>
>
>
> From Kafka to s3 is straight forward but I cannot seem to find the right
> processor to convert JSON to parquet and move it to s3.
>
>
>
> putparquet does not take s3 bucket or credentials and requires hadoop to
> be installed.
>
>
>
> Can someone please share a blog or steps to achieve this. Thanks in advance
>
>
>
> -Dweep
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachments are confidential and
> intended for the named recipient(s) only.E-mail transmission is not
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted,lost, destroyed, arrive late or incomplete, or may contain
> viruses in transmission. The e mail and its contents(with or without
> referred errors) shall therefore not attach any liability on the originator
> or redBus.com. Views or opinions, if any, presented in this email are
> solely those of the author and may not necessarily reflect the views or
> opinions of redBus.com. Any form of reproduction, dissemination, copying,
> disclosure, modification,distribution and / or publication of this message
> without the prior written consent of authorized representative of*
> redbus.
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=>com*
> is strictly prohibited. If you have received this email in error please
> delete it and notify the sender immediately.Before opening any email and/or
> attachments, please check them for viruses and other defects.
>
>
>
>
> --
>
> Best,
>
> Sneh
>
>
> ::DISCLAIMER::
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------
>
> The contents of this e-mail and any attachments are confidential and
> intended for the named recipient(s) only.E-mail transmission is not
> guaranteed to be secure or error-free as information could be intercepted,
> corrupted,lost, destroyed, arrive late or incomplete, or may contain
> viruses in transmission. The e mail and its contents(with or without
> referred errors) shall therefore not attach any liability on the originator
> or redBus.com. Views or opinions, if any, presented in this email are
> solely those of the author and may not necessarily reflect the views or
> opinions of redBus.com. Any form of reproduction, dissemination, copying,
> disclosure, modification,distribution and / or publication of this message
> without the prior written consent of authorized representative of*
> redbus.
> <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=>com*
> is strictly prohibited. If you have received this email in error please
> delete it and notify the sender immediately.Before opening any email and/or
> attachments, please check them for viruses and other defects.
>
> --
>
> Sent from Gmail Mobile
>

-- 
*::DISCLAIMER::

----------------------------------------------------------------------------------------------------------------------------------------------------


The contents of this e-mail and any attachments are confidential and 
intended for the named recipient(s) only.E-mail transmission is not 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted,lost, destroyed, arrive late or incomplete, or may contain 
viruses in transmission. The e mail and its contents(with or without 
referred errors) shall therefore not attach any liability on the originator 
or redBus.com. Views or opinions, if any, presented in this email are 
solely those of the author and may not necessarily reflect the views or 
opinions of redBus.com. Any form of reproduction, dissemination, copying, 
disclosure, modification,distribution and / or publication of this message 
without the prior written consent of authorized representative of redbus. 
<http://redbus.in/>com is strictly prohibited. If you have received this 
email in error please delete it and notify the sender immediately.Before 
opening any email and/or attachments, please check them for viruses and 
other defects.*

RE: Kafka to parquet to s3

Posted by "Williams, Jim" <jw...@alertlogic.com>.
Dweep,

 

I have been working on a project where Parquet files are being written to S3.  I’ve had the liberty to use the most up-to-date version of Nifi, so I have implemented this on 1.9.2.

 

The approach I have taken is something like this:

 

(Get stuff in JSON format) --> ConvertRecord --> ConvertAvroToParquet --> PutS3

 

The ConvertRecord [1] processor changes the flow files from JSON to Avro.  Although it is possible to use schema inference with this processor, it is something we have not leveraged yet.  The ConvertAvroToParquet [2] converts the flow file, but does not write it out to a local or HDFS file system like the PutParquet [3] processor would.

 

Implementing the flow in this way gives a couple advantages:

 

1.	We do not need to use the PutParquet processor

a.	Extra configuration on cluster nodes is avoided for writing directly to S3 with this processor
b.	Writing to a local or HDFS filesystem and then copying to S3 is avoided

2.	We can use the native authentication methods which come with the S3 processor

a.	Roles associated with EC2 instances are leveraged, which makes cluster deployment much simpler

 

We have been happy using this pattern for the past couple months.  I am watching for progress on Nifi-6089 [4] for a Parquet Record Reader/Writer with interest.

 

 

 

[1] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.9.2/org.apache.nifi.processors.standard.ConvertRecord/index.html

[2] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.ConvertAvroToParquet/index.html

[3] - https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-parquet-nar/1.9.2/org.apache.nifi.processors.parquet.PutParquet/index.html

[4] - https://issues.apache.org/jira/browse/NIFI-6089

 

 

Warm regards,

 


 <https://www.alertlogic.com/> 

Jim Williams | Principal Database Developer


O: +1 713.341.7812 | C: +1 919.523.8767 | jwilliams@alertlogic.com |  <http://www.alertlogic.com/> alertlogic.com  <https://twitter.com/alertlogic>  <https://www.linkedin.com/company/alert-logic> 


 



 

From: Bryan Bende <bb...@gmail.com> 
Sent: Saturday, July 6, 2019 11:49 AM
To: users@nifi.apache.org
Subject: Re: Kafka to parquet to s3

 

Currently put and fetch parquet are tied to the Hadoop API so they need the config files. As mentioned you can create a core-site with a local file system, and then you could use another part of the flow to pick up the file using ListFile -> FetchFile -> PutS3Object.

 

There is a way to write to S3 directly from PutParquet and PutHdfs, but it requires additional jars and config, and is honestly harder to setup then just using the above approach.

 

There is also a JIRA to implement a parquet record reader and writer which would then let you use ConvertRecord to go from JSON to parquet, and then to PutS3Object.

 

I think the error mentioned means you have the same field name at different levels in your JSON and that is not allowed in an Avro schema.

 

On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dweep.sharma@redbus.com <ma...@redbus.com> > wrote:

Thanks Shanker, 

 

But I do not see options in putparquet for s3 bucket/credential, I am assuming I need to push this to a local store and then add the puts3object processor on top of that ?

 

Also, as record reader in putparquet I am using the JsonTreeReader with defaults (InferSchema)  and I get the error "Failed to write due to can't redefine: org.apache.nifi.addresstype: 

 

Some files however do get written. Are the default settings good or am I missing something ?

 

 

-Dweep

 

 

 

On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <aperepel@gmail.com <ma...@gmail.com> > wrote:

Interestingly enough, the ORC processor in NiFi can just use defaults if hadoop configs aren't provided, no additional config steps required. Is it something which can be improved for PutParquet maybe?

 

Andrew

 

 

 

On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <sh...@zoomcar.com> wrote:

Hello Dweep,

 

In putparquet processor you can set the attribute 'Hadoop Configuration Resources' to a core-site.xml file whose content can be somewhat like below:

 

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>file:///reservoir-dl <file:///reservoir-dl%3c/value> </value>

    </property>

</configuration>

 

Here the file:///reservoir-dl could be your path where in-transit parquet files have be written to -- before being pushed to S3.

More importantly, you do not need Hadoop to be installed. You can just place the core-site.xml file on your NiFi nodes and get started.

 

 

On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dweep.sharma@redbus.com <ma...@redbus.com> > wrote:

Hi, 

 

I have been trying to move some JSON data to S3 in Parquet format. 

 

From Kafka to s3 is straight forward but I cannot seem to find the right processor to convert JSON to parquet and move it to s3.

 

putparquet does not take s3 bucket or credentials and requires hadoop to be installed. 

 

Can someone please share a blog or steps to achieve this. Thanks in advance

 

-Dweep


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.




 

-- 

Best,

Sneh


::DISCLAIMER::
----------------------------------------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachments are confidential and intended for the named recipient(s) only.E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted,lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents(with or without referred errors) shall therefore not attach any liability on the originator or redBus.com. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of redBus.com. Any form of reproduction, dissemination, copying, disclosure, modification,distribution and / or publication of this message without the prior written consent of authorized representative of redbus. <https://urldefense.proofpoint.com/v2/url?u=http-3A__redbus.in_&d=DwMFaQ&c=L_h2OePR2UWWefmqrezxOsP9Uqw55rRfX5bRtw9S4KY&r=8BKCOHGXeuGDgPbW9jE4jktuFFiof_whsQaGaYqyyjs&m=0TYoCDJ4Lt2B5pKdtP7nUjspleKrASukFc4wPpwn7nE&s=g0beRWiOmVpZgFikYTh2BVHRMTiHH-iqlFPmu75uefs&e=> com is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately.Before opening any email and/or attachments, please check them for viruses and other defects.

-- 

Sent from Gmail Mobile


Re: Kafka to parquet to s3

Posted by Bryan Bende <bb...@gmail.com>.
Currently put and fetch parquet are tied to the Hadoop API so they need the
config files. As mentioned you can create a core-site with a local file
system, and then you could use another part of the flow to pick up the file
using ListFile -> FetchFile -> PutS3Object.

There is a way to write to S3 directly from PutParquet and PutHdfs, but it
requires additional jars and config, and is honestly harder to setup then
just using the above approach.

There is also a JIRA to implement a parquet record reader and writer which
would then let you use ConvertRecord to go from JSON to parquet, and then
to PutS3Object.

I think the error mentioned means you have the same field name at different
levels in your JSON and that is not allowed in an Avro schema.

On Sat, Jul 6, 2019 at 9:21 AM Dweep Sharma <dw...@redbus.com> wrote:

> Thanks Shanker,
>
> But I do not see options in putparquet for s3 bucket/credential, I am
> assuming I need to push this to a local store and then add the puts3object
> processor on top of that ?
>
> Also, as record reader in putparquet I am using the JsonTreeReader with
> defaults (InferSchema)  and I get the error "Failed to write due to can't
> redefine: org.apache.nifi.addresstype:
>
> Some files however do get written. Are the default settings good or am I
> missing something ?
>
>
> -Dweep
>
>
>
> On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <ap...@gmail.com> wrote:
>
>> Interestingly enough, the ORC processor in NiFi can just use defaults if
>> hadoop configs aren't provided, no additional config steps required. Is it
>> something which can be improved for PutParquet maybe?
>>
>> Andrew
>>
>>
>>
>> On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <sh...@zoomcar.com>
>> wrote:
>>
>>> Hello Dweep,
>>>
>>> In putparquet processor you can set the attribute '*Hadoop
>>> Configuration Resources*' to a *core-site.xml* file whose content can
>>> be somewhat like below:
>>>
>>> <configuration>
>>>
>>>     <property>
>>>
>>>         <name>fs.defaultFS</name>
>>>
>>>         <value>file:///reservoir-dl</value>
>>>
>>>     </property>
>>>
>>> </configuration>
>>>
>>> Here the file:///reservoir-dl could be your path where
>>> in-transit parquet files have be written to -- before being pushed to S3.
>>> More importantly, you *do not* need Hadoop to be installed. You can
>>> just place the core-site.xml file on your NiFi nodes and get started.
>>>
>>>
>>> On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dw...@redbus.com>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> I have been trying to move some JSON data to S3 in Parquet format.
>>>>
>>>> From Kafka to s3 is straight forward but I cannot seem to find the
>>>> right processor to convert JSON to parquet and move it to s3.
>>>>
>>>> putparquet does not take s3 bucket or credentials and requires hadoop
>>>> to be installed.
>>>>
>>>> Can someone please share a blog or steps to achieve this. Thanks in
>>>> advance
>>>>
>>>> -Dweep
>>>>
>>>>
>>>>
>>>>
>>>> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
>>>> contents of this e-mail and any attachments are confidential and intended
>>>> for the named recipient(s) only.E-mail transmission is not guaranteed to be
>>>> secure or error-free as information could be intercepted, corrupted,lost,
>>>> destroyed, arrive late or incomplete, or may contain viruses in
>>>> transmission. The e mail and its contents(with or without referred errors)
>>>> shall therefore not attach any liability on the originator or redBus.com.
>>>> Views or opinions, if any, presented in this email are solely those of the
>>>> author and may not necessarily reflect the views or opinions of redBus.com.
>>>> Any form of reproduction, dissemination, copying, disclosure,
>>>> modification,distribution and / or publication of this message without the
>>>> prior written consent of authorized representative of redbus.
>>>> <http://redbus.in/>com is strictly prohibited. If you have received this
>>>> email in error please delete it and notify the sender immediately.Before
>>>> opening any email and/or attachments, please check them for viruses and
>>>> other defects.*
>>>
>>>
>>>
>>> --
>>> Best,
>>> Sneh
>>>
>>
>
>
>
> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
> contents of this e-mail and any attachments are confidential and intended
> for the named recipient(s) only.E-mail transmission is not guaranteed to be
> secure or error-free as information could be intercepted, corrupted,lost,
> destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents(with or without referred errors)
> shall therefore not attach any liability on the originator or redBus.com.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the views or opinions of redBus.com.
> Any form of reproduction, dissemination, copying, disclosure,
> modification,distribution and / or publication of this message without the
> prior written consent of authorized representative of redbus.
> <http://redbus.in/>com is strictly prohibited. If you have received this
> email in error please delete it and notify the sender immediately.Before
> opening any email and/or attachments, please check them for viruses and
> other defects.*

-- 
Sent from Gmail Mobile

Re: Kafka to parquet to s3

Posted by Dweep Sharma <dw...@redbus.com>.
Thanks Shanker,

But I do not see options in putparquet for s3 bucket/credential, I am
assuming I need to push this to a local store and then add the puts3object
processor on top of that ?

Also, as record reader in putparquet I am using the JsonTreeReader with
defaults (InferSchema)  and I get the error "Failed to write due to can't
redefine: org.apache.nifi.addresstype:

Some files however do get written. Are the default settings good or am I
missing something ?

-Dweep



On Fri, Jul 5, 2019 at 11:43 PM Andrew Grande <ap...@gmail.com> wrote:

> Interestingly enough, the ORC processor in NiFi can just use defaults if
> hadoop configs aren't provided, no additional config steps required. Is it
> something which can be improved for PutParquet maybe?
>
> Andrew
>
>
>
> On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <sh...@zoomcar.com>
> wrote:
>
>> Hello Dweep,
>>
>> In putparquet processor you can set the attribute '*Hadoop Configuration
>> Resources*' to a *core-site.xml* file whose content can be somewhat like
>> below:
>>
>> <configuration>
>>
>>     <property>
>>
>>         <name>fs.defaultFS</name>
>>
>>         <value>file:///reservoir-dl</value>
>>
>>     </property>
>>
>> </configuration>
>>
>> Here the file:///reservoir-dl could be your path where in-transit parquet
>> files have be written to -- before being pushed to S3.
>> More importantly, you *do not* need Hadoop to be installed. You can just
>> place the core-site.xml file on your NiFi nodes and get started.
>>
>>
>> On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dw...@redbus.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I have been trying to move some JSON data to S3 in Parquet format.
>>>
>>> From Kafka to s3 is straight forward but I cannot seem to find the right
>>> processor to convert JSON to parquet and move it to s3.
>>>
>>> putparquet does not take s3 bucket or credentials and requires hadoop to
>>> be installed.
>>>
>>> Can someone please share a blog or steps to achieve this. Thanks in
>>> advance
>>>
>>> -Dweep
>>>
>>>
>>>
>>>
>>> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
>>> contents of this e-mail and any attachments are confidential and intended
>>> for the named recipient(s) only.E-mail transmission is not guaranteed to be
>>> secure or error-free as information could be intercepted, corrupted,lost,
>>> destroyed, arrive late or incomplete, or may contain viruses in
>>> transmission. The e mail and its contents(with or without referred errors)
>>> shall therefore not attach any liability on the originator or redBus.com.
>>> Views or opinions, if any, presented in this email are solely those of the
>>> author and may not necessarily reflect the views or opinions of redBus.com.
>>> Any form of reproduction, dissemination, copying, disclosure,
>>> modification,distribution and / or publication of this message without the
>>> prior written consent of authorized representative of redbus.
>>> <http://redbus.in/>com is strictly prohibited. If you have received this
>>> email in error please delete it and notify the sender immediately.Before
>>> opening any email and/or attachments, please check them for viruses and
>>> other defects.*
>>
>>
>>
>> --
>> Best,
>> Sneh
>>
>

-- 
*::DISCLAIMER::

----------------------------------------------------------------------------------------------------------------------------------------------------


The contents of this e-mail and any attachments are confidential and 
intended for the named recipient(s) only.E-mail transmission is not 
guaranteed to be secure or error-free as information could be intercepted, 
corrupted,lost, destroyed, arrive late or incomplete, or may contain 
viruses in transmission. The e mail and its contents(with or without 
referred errors) shall therefore not attach any liability on the originator 
or redBus.com. Views or opinions, if any, presented in this email are 
solely those of the author and may not necessarily reflect the views or 
opinions of redBus.com. Any form of reproduction, dissemination, copying, 
disclosure, modification,distribution and / or publication of this message 
without the prior written consent of authorized representative of redbus. 
<http://redbus.in/>com is strictly prohibited. If you have received this 
email in error please delete it and notify the sender immediately.Before 
opening any email and/or attachments, please check them for viruses and 
other defects.*

Re: Kafka to parquet to s3

Posted by Andrew Grande <ap...@gmail.com>.
Interestingly enough, the ORC processor in NiFi can just use defaults if
hadoop configs aren't provided, no additional config steps required. Is it
something which can be improved for PutParquet maybe?

Andrew



On Fri, Jul 5, 2019, 4:18 AM Shanker Sneh <sh...@zoomcar.com> wrote:

> Hello Dweep,
>
> In putparquet processor you can set the attribute '*Hadoop Configuration
> Resources*' to a *core-site.xml* file whose content can be somewhat like
> below:
>
> <configuration>
>
>     <property>
>
>         <name>fs.defaultFS</name>
>
>         <value>file:///reservoir-dl</value>
>
>     </property>
>
> </configuration>
>
> Here the file:///reservoir-dl could be your path where in-transit parquet
> files have be written to -- before being pushed to S3.
> More importantly, you *do not* need Hadoop to be installed. You can just
> place the core-site.xml file on your NiFi nodes and get started.
>
>
> On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dw...@redbus.com>
> wrote:
>
>> Hi,
>>
>> I have been trying to move some JSON data to S3 in Parquet format.
>>
>> From Kafka to s3 is straight forward but I cannot seem to find the right
>> processor to convert JSON to parquet and move it to s3.
>>
>> putparquet does not take s3 bucket or credentials and requires hadoop to
>> be installed.
>>
>> Can someone please share a blog or steps to achieve this. Thanks in
>> advance
>>
>> -Dweep
>>
>>
>>
>>
>> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
>> contents of this e-mail and any attachments are confidential and intended
>> for the named recipient(s) only.E-mail transmission is not guaranteed to be
>> secure or error-free as information could be intercepted, corrupted,lost,
>> destroyed, arrive late or incomplete, or may contain viruses in
>> transmission. The e mail and its contents(with or without referred errors)
>> shall therefore not attach any liability on the originator or redBus.com.
>> Views or opinions, if any, presented in this email are solely those of the
>> author and may not necessarily reflect the views or opinions of redBus.com.
>> Any form of reproduction, dissemination, copying, disclosure,
>> modification,distribution and / or publication of this message without the
>> prior written consent of authorized representative of redbus.
>> <http://redbus.in/>com is strictly prohibited. If you have received this
>> email in error please delete it and notify the sender immediately.Before
>> opening any email and/or attachments, please check them for viruses and
>> other defects.*
>
>
>
> --
> Best,
> Sneh
>

Re: Kafka to parquet to s3

Posted by Shanker Sneh <sh...@zoomcar.com>.
Hello Dweep,

In putparquet processor you can set the attribute '*Hadoop Configuration
Resources*' to a *core-site.xml* file whose content can be somewhat like
below:

<configuration>

    <property>

        <name>fs.defaultFS</name>

        <value>file:///reservoir-dl</value>

    </property>

</configuration>

Here the file:///reservoir-dl could be your path where in-transit parquet
files have be written to -- before being pushed to S3.
More importantly, you *do not* need Hadoop to be installed. You can just
place the core-site.xml file on your NiFi nodes and get started.


On Fri, Jul 5, 2019 at 1:54 PM Dweep Sharma <dw...@redbus.com> wrote:

> Hi,
>
> I have been trying to move some JSON data to S3 in Parquet format.
>
> From Kafka to s3 is straight forward but I cannot seem to find the right
> processor to convert JSON to parquet and move it to s3.
>
> putparquet does not take s3 bucket or credentials and requires hadoop to
> be installed.
>
> Can someone please share a blog or steps to achieve this. Thanks in advance
>
> -Dweep
>
>
>
>
> *::DISCLAIMER::----------------------------------------------------------------------------------------------------------------------------------------------------The
> contents of this e-mail and any attachments are confidential and intended
> for the named recipient(s) only.E-mail transmission is not guaranteed to be
> secure or error-free as information could be intercepted, corrupted,lost,
> destroyed, arrive late or incomplete, or may contain viruses in
> transmission. The e mail and its contents(with or without referred errors)
> shall therefore not attach any liability on the originator or redBus.com.
> Views or opinions, if any, presented in this email are solely those of the
> author and may not necessarily reflect the views or opinions of redBus.com.
> Any form of reproduction, dissemination, copying, disclosure,
> modification,distribution and / or publication of this message without the
> prior written consent of authorized representative of redbus.
> <http://redbus.in/>com is strictly prohibited. If you have received this
> email in error please delete it and notify the sender immediately.Before
> opening any email and/or attachments, please check them for viruses and
> other defects.*



-- 
Best,
Sneh