You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Carlos Paradis <cv...@hawaii.edu> on 2017/02/13 23:13:09 UTC

Integration between Apache NiFi and Parquet or Workaround?

Hi,

Our group has recently started trying to prototype a setup of
Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
documentation other than a scarce discussion on using Kite as a workaround
to integrate NiFi and Parquet.

Are there any future plans for this integration from NiFi or anyone would
be able to give me some insight in which scenario this workaround would
(not) be worthwhile and alternatives?

The most recent discussion
<http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
I found in this list dates from May 11, 2016. I also saw some interest in
doing this on Stackoverflow here
<http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
and here
<http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
.

Thanks,

-- 
Carlos Paradis
http://carlosparadis.com <http://carlosandrade.co>

Re: Integration between Apache NiFi and Parquet or Workaround?

Posted by Bryan Bende <bb...@gmail.com>.

I'll caveat this by saying that up until 10 mins ago I had never
looked at Parquet, so I could be completely wrong, but...

The Parquet API seems heavily geared towards HDFS. For example, take
the AvroParquetWriter:

https://github.com/Parquet/parquet-mr/blob/master/parquet-avro/src/main/java/parquet/avro/AvroParquetWriter.java

You have to give it a Hadoop Path object to write the data to, so this
wouldn't really work in the middle of a NiFi flow if you wanted to
have a processor like ConvertXyzToParquet, because the processor needs
to write the output to an OutputStream which is a location in NiFi's
internal repositories, not HDFS.

It could make sense at the end of a flow when writing to HDFS, so you
could probably implement a custom processor similar to PutHDFS that
used the Parquet libraries to write the data to HDFS as Parquet
(assuming you merged together a bunch of data before this) . This is
probably what the Kite processors are already doing, but not sure.

-Bryan

On Tue, Feb 14, 2017 at 5:12 PM, Carlos Paradis <cv...@hawaii.edu> wrote:
> Hi James,
>
> Thank you for pointing the issue out! :-) I wanted to point out another
> alternative solution to Kite I observed, to hear if you had any insight on
> this approach too if you don't mind.
>
> When I saw a presentation of Ni-Fi and Parquet being used in a guest
> project, although not many details implementation wise were discussed, it
> was mentioned using also Apache Spark (apparently only) leaving a port from
> Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
> (github) on Jan 2016 that seems to head towards that direction.
>
> The configuration looked as follows according to the tutorial's image:
>
> https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png
>
>
> The group presentation also used Spark, but I am not sure if they used the
> same port approach, this is all I have:
>
>
> PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>]
>
> PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap ->
> RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL
> Context) -> Write Parquet (DataFrame).
>
> When you say,
>
>> then running periodic jobs to build Parquet data sets.
>
>
> Would such Spark setup be the case as period jobs? I am minimally acquainted
> on how Spark goes about MapReduce using RDDs, but I am not certain to what
> extent it would support the NiFi pipeline for such purpose (not to mention,
> on the way it appears, seems to leave a hole in NiFi diagram as a port,
> which makes it unable to monitor for data provenance).
>
> ---
>
> Do you think these details and Kite details would be worth mentioning as a
> comment on the JIRA issue you pointed out?
>
> Thanks!
>
>
> On Tue, Feb 14, 2017 at 11:46 AM, James Wing <jv...@gmail.com> wrote:
>>
>> Carlos,
>>
>> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
>> built-in solution for writing Parquet files from NiFi.
>>
>> I'm not an expert on Parquet, but I understand columnar formats like
>> Parquet and ORC are not easily written to in the incremental, streaming
>> fashion that NiFi excels at (I hope writing this will prompt expert
>> correction).  Other alternatives typically involve NiFi writing to more
>> stream-friendly data stores or formats directly, then running periodic jobs
>> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>>
>> You are certainly not alone in wanting better Parquet support, there is at
>> least one JIRA ticket for it as well:
>>
>> Add processors for Google Cloud Storage Fetch/Put/Delete
>> https://issues.apache.org/jira/browse/NIFI-2725
>>
>> You might want to chime in with some details of your use case, or create a
>> new ticket if that's not a fit for you.
>>
>> Thanks,
>>
>> James
>>
>> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <cv...@hawaii.edu> wrote:
>>>
>>> Hi,
>>>
>>> Our group has recently started trying to prototype a setup of
>>> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
>>> documentation other than a scarce discussion on using Kite as a workaround
>>> to integrate NiFi and Parquet.
>>>
>>> Are there any future plans for this integration from NiFi or anyone would
>>> be able to give me some insight in which scenario this workaround would
>>> (not) be worthwhile and alternatives?
>>>
>>> The most recent discussion I found in this list dates from May 11, 2016.
>>> I also saw some interest in doing this on Stackoverflow here, and here.
>>>
>>> Thanks,
>>>
>>> --
>>> Carlos Paradis
>>> http://carlosparadis.com
>>
>>
>
>
>
> --
> Carlos Paradis
> http://carlosparadis.com

Re: Integration between Apache NiFi and Parquet or Workaround?

Posted by Carlos Paradis <cv...@hawaii.edu>.

Thank you, both Bryan and Giovanni for giving me so much insight on this
matter.

I see why you would strongly prefer Kite over this, now that I landed on one
tutorial
<http://blog.cloudera.com/blog/2014/12/how-to-ingest-data-quickly-using-the-kite-cli/>
on kite-dataset and their documentation page <http://kitesdk.org/docs/1.1.0>.
(thanks for pointing the name out).

I also noticed NiFi-238
<https://issues.apache.org/jira/browse/NIFI-238?focusedCommentId=14350688&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14350688>
 (Pull Request <https://github.com/apache/nifi/pull/24#discussion_r24779170>)
has incorporated Kite into Nifi back in 2015 and NiFi-1193
<https://issues.apache.org/jira/browse/NIFI-1193> to Hive in 2016 and made
available 3 processors, but I am confused since they are no longer
available in the documentation <https://nifi.apache.org/docs/nifi-docs/>,
rather I only see StoreInKiteDataset
<https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi.processors.kite.StoreInKiteDataset/index.html>,
which appear to be a new version of what was called 'KiteStorageProcessor'
in the Github, but I don't see the other two.


My original goal was to have one HDFS storage dedicated to raw data alone,
and a second HDFS dedicated to storing pre-processed data and analysis. If
I were to do this with Kite and NiFi, the way I am currently seeing this
being done is:

---------
*Raw Data HDFS:*

   - - Apache Nifi
      - A set of GetFile and GetHTTP processors to acquire the data from
      multiple sources we have.
      - A PutHDFS to store the raw data in HDFS.

*Pre-Processed & Analysis HDFS: *


   - - ApacheNifi
      - A set of GetHDFS to get data from *Raw Data HDFS*.
      - A set of ExecuteScript to convert XML files to JSON or CSV.
      - A set of ConvertCSVToAvro and ConvertJSONToAvro as the Kite
      processor requires AVRO format.
      - StoreInKiteDataset to store all data in either Avro or Parquet
      format.
      - - Apache Spark
      - Perform batch jobs to pre-process the data into data analysis sets
      to be exported elsewhere (dashboard, machine learning, etc).

---------

However, a few things I am still confused about: (1) is this the best way
to go about storing the raw data? (2) Would ExecuteScript allow for map
reduce? Or would this become a bottleneck? (3) I originally considered
using the Apache Spark Stream module for mini-batches integrated with
ni-fi to at least pre-process the data as it arrives, but I am a bit
unclear on how to go about this now. Would create a port through Ni-Fi be
the way to go? (That is the only way I saw it being done on tutorials).

Thank you,




On Wed, Feb 15, 2017 at 7:02 AM, Giovanni Lanzani <
giovannilanzani@godatadriven.com> wrote:

> Hi Carlos,
>
>
>
> I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in
> this case) the workflow would look like this:
>
>
>
> - do stuff with NiFi
>
> - convert flowfiles to Avro
>
> - (optional: merge Avro files)
>
> - PutHDFS into a temp folder
>
> - periodically run Spark on that temp folder to convert to Parquet.
>
>
>
> I believe you can work out the first four points by yourself. The last
> point would just be a Python file that looks like this:
>
>
>
> from pyspark.sql import SparkSession
>
>
>
> spark = (SparkSession.builder
>
>                      .appName("Python Spark SQL basic example")
>
>          .config("spark.some.config.option", "some-value")
>
>          .getOrCreate())
>
>
>
> (spark.read.format('com.databricks.spark.avro').load('
> /tmp/path/dataset.avro')
>
>           .write.format('parquet')
>
>           .mode('append')
>
>           .save('/path/to/outfile'))
>
>
>
> You can then periodically invoke this file with spark-submit filename.py
>
>
>
> For optimal usage, I’d explore the options of having the temporary path
> folder partitioned by hour (or day) and then invoke the above script once
> per temporary folder.
>
>
>
> That said, a few remarks:
>
> - this is a rather complicated flow for something so simple. Kite-dataset
> would work better;
>
> - however if you need more complicated processing, you have all the
> options to do so
>
> - as Parquet is columnar storage, having little files is useless. So when
> you’re merging them, make sure you have enough data (>~ 50MB and to several
> tens of GB’s) in the final file;
>
> - The above code is trivially portable to Scala if you prefer, as I’m
> using Python as a mere DSL on top of Spark (no serializations outside the
> JVM).
>
>
>
> Cheers,
>
>
>
> Giovanni
>
>
>
> *From:* Carlos Paradis [mailto:cvas@hawaii.edu]
> *Sent:* Tuesday, February 14, 2017 11:12 PM
> *To:* users@nifi.apache.org
> *Subject:* Re: Integration between Apache NiFi and Parquet or Workaround?
>
>
>
> Hi James,
>
>
>
> Thank you for pointing the issue out! :-) I wanted to point out another
> alternative solution to Kite I observed, to hear if you had any insight on
> this approach too if you don't mind.
>
>
>
> When I saw a presentation of Ni-Fi and Parquet being used in a guest
> project, although not many details implementation wise were discussed, it
> was mentioned using also Apache Spark (apparently only) leaving a port from
> Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
> <https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
>  (github
> <https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
> on Jan 2016 that seems to head towards that direction.
>
>
>
> The configuration looked as follows according to the tutorial's image:
>
>
>
> https://community.hortonworks.com/storage/attachments/1669-
> screen-shot-2016-01-31-at-21029-pm.png
>
>
>
> The group presentation also used Spark, but I am not sure if they used the
> same port approach, this is all I have:
>
>
>
> *PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD
> <String,String>]
>
>
>
> *PackageToParquetRunner *-> FileProcessorClass -> RDD Filter ->
> RDDflatMap -> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create
> Data Frame (SQL Context) -> Write Parquet (DataFrame).
>
>
>
> When you say,
>
>
>
> then running periodic jobs to build Parquet data sets.
>
>
>
> Would such Spark setup be the case as period jobs? I am minimally
> acquainted on how Spark goes about MapReduce using RDDs, but I am not
> certain to what extent it would support the NiFi pipeline for such purpose
> (not to mention, on the way it appears, seems to leave a hole in NiFi
> diagram as a port, which makes it unable to monitor for data provenance).
>
>
>
> ---
>
>
>
> Do you think these details and Kite details would be worth mentioning as a
> comment on the JIRA issue you pointed out?
>
>
> Thanks!
>
>
>
>
>
> On Tue, Feb 14, 2017 at 11:46 AM, James Wing <jv...@gmail.com> wrote:
>
> Carlos,
>
> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
> built-in solution for writing Parquet files from NiFi.
>
> I'm not an expert on Parquet, but I understand columnar formats like
> Parquet and ORC are not easily written to in the incremental, streaming
> fashion that NiFi excels at (I hope writing this will prompt expert
> correction).  Other alternatives typically involve NiFi writing to more
> stream-friendly data stores or formats directly, then running periodic jobs
> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>
> You are certainly not alone in wanting better Parquet support, there is at
> least one JIRA ticket for it as well:
>
> Add processors for Google Cloud Storage Fetch/Put/Delete
> https://issues.apache.org/jira/browse/NIFI-2725
>
> You might want to chime in with some details of your use case, or create a
> new ticket if that's not a fit for you.
>
>
>
> Thanks,
>
> James
>
>
>
> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <cv...@hawaii.edu> wrote:
>
> Hi,
>
>
>
> Our group has recently started trying to prototype a setup of
> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
> documentation other than a scarce discussion on using Kite as a workaround
> to integrate NiFi and Parquet.
>
>
>
> Are there any future plans for this integration from NiFi or anyone would
> be able to give me some insight in which scenario this workaround would
> (not) be worthwhile and alternatives?
>
>
>
> The most recent discussion
> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
> I found in this list dates from May 11, 2016. I also saw some interest in
> doing this on Stackoverflow here
> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
> and here
> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
> .
>
>
>
> Thanks,
>
>
>
> --
>
> Carlos Paradis
>
> http://carlosparadis.com <http://carlosandrade.co>
>
>
>
>
>
>
>
> --
>
> Carlos Paradis
>
> http://carlosparadis.com <http://carlosandrade.co>
>



-- 
Carlos Paradis
http://carlosparadis.com <http://carlosandrade.co>

RE: Integration between Apache NiFi and Parquet or Workaround?

Posted by Giovanni Lanzani <gi...@godatadriven.com>.

Hi Carlos,

I’m just chiming in, but if I wouldn’t use Kite (disclaimer: I would in this case) the workflow would look like this:

- do stuff with NiFi
- convert flowfiles to Avro
- (optional: merge Avro files)
- PutHDFS into a temp folder
- periodically run Spark on that temp folder to convert to Parquet.

I believe you can work out the first four points by yourself. The last point would just be a Python file that looks like this:

from pyspark.sql import SparkSession

spark = (SparkSession.builder
                     .appName("Python Spark SQL basic example")
         .config("spark.some.config.option", "some-value")
         .getOrCreate())

(spark.read.format('com.databricks.spark.avro').load('/tmp/path/dataset.avro')
          .write.format('parquet')
          .mode('append')
          .save('/path/to/outfile'))

You can then periodically invoke this file with spark-submit filename.py

For optimal usage, I’d explore the options of having the temporary path folder partitioned by hour (or day) and then invoke the above script once per temporary folder.

That said, a few remarks:
- this is a rather complicated flow for something so simple. Kite-dataset would work better;
- however if you need more complicated processing, you have all the options to do so
- as Parquet is columnar storage, having little files is useless. So when you’re merging them, make sure you have enough data (>~ 50MB and to several tens of GB’s) in the final file;
- The above code is trivially portable to Scala if you prefer, as I’m using Python as a mere DSL on top of Spark (no serializations outside the JVM).

Cheers,

Giovanni

From: Carlos Paradis [mailto:cvas@hawaii.edu]
Sent: Tuesday, February 14, 2017 11:12 PM
To: users@nifi.apache.org
Subject: Re: Integration between Apache NiFi and Parquet or Workaround?

Hi James,

Thank you for pointing the issue out! :-) I wanted to point out another alternative solution to Kite I observed, to hear if you had any insight on this approach too if you don't mind.

When I saw a presentation of Ni-Fi and Parquet being used in a guest project, although not many details implementation wise were discussed, it was mentioned using also Apache Spark (apparently only) leaving a port from Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it<https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html> (github<https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>) on Jan 2016 that seems to head towards that direction.

The configuration looked as follows according to the tutorial's image:

https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png

The group presentation also used Spark, but I am not sure if they used the same port approach, this is all I have:

PackageToParquetRunner <-> getFilePaths() <-> datalake [RDD <String,String>]

PackageToParquetRunner -> FileProcessorClass -> RDD Filter -> RDDflatMap -> RDDMap -> RDD <row> -> PackageToParquetRunner -> Create Data Frame (SQL Context) -> Write Parquet (DataFrame).

When you say,

then running periodic jobs to build Parquet data sets.

Would such Spark setup be the case as period jobs? I am minimally acquainted on how Spark goes about MapReduce using RDDs, but I am not certain to what extent it would support the NiFi pipeline for such purpose (not to mention, on the way it appears, seems to leave a hole in NiFi diagram as a port, which makes it unable to monitor for data provenance).

---

Do you think these details and Kite details would be worth mentioning as a comment on the JIRA issue you pointed out?

Thanks!


On Tue, Feb 14, 2017 at 11:46 AM, James Wing <jv...@gmail.com>> wrote:
Carlos,
Welcome to NiFi!  I believe the Kite dataset is currently the most direct, built-in solution for writing Parquet files from NiFi.

I'm not an expert on Parquet, but I understand columnar formats like Parquet and ORC are not easily written to in the incremental, streaming fashion that NiFi excels at (I hope writing this will prompt expert correction).  Other alternatives typically involve NiFi writing to more stream-friendly data stores or formats directly, then running periodic jobs to build Parquet data sets.  Hive, Drill, and similar tools can do this.

You are certainly not alone in wanting better Parquet support, there is at least one JIRA ticket for it as well:

Add processors for Google Cloud Storage Fetch/Put/Delete
https://issues.apache.org/jira/browse/NIFI-2725
You might want to chime in with some details of your use case, or create a new ticket if that's not a fit for you.

Thanks,
James

On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <cv...@hawaii.edu>> wrote:
Hi,

Our group has recently started trying to prototype a setup of Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any documentation other than a scarce discussion on using Kite as a workaround to integrate NiFi and Parquet.

Are there any future plans for this integration from NiFi or anyone would be able to give me some insight in which scenario this workaround would (not) be worthwhile and alternatives?

The most recent discussion<http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html> I found in this list dates from May 11, 2016. I also saw some interest in doing this on Stackoverflow here<http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>, and here<http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>.

Thanks,

--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>




--
Carlos Paradis
http://carlosparadis.com<http://carlosandrade.co>

Re: Integration between Apache NiFi and Parquet or Workaround?

Posted by Carlos Paradis <cv...@hawaii.edu>.

Hi James,

Thank you for pointing the issue out! :-) I wanted to point out another
alternative solution to Kite I observed, to hear if you had any insight on
this approach too if you don't mind.

When I saw a presentation of Ni-Fi and Parquet being used in a guest
project, although not many details implementation wise were discussed, it
was mentioned using also Apache Spark (apparently only) leaving a port from
Ni-Fi to read in the data. Someone in Hortonworks posted a tutorial on it
<https://community.hortonworks.com/articles/12708/nifi-feeding-data-to-spark-streaming.html>
 (github
<https://github.com/hortonworks-meetup-content/intro-to-apache-spark-streaming-with-apache-nifi-and-apache-kafka>)
on Jan 2016 that seems to head towards that direction.

The configuration looked as follows according to the tutorial's image:

https://community.hortonworks.com/storage/attachments/1669-screen-shot-2016-01-31-at-21029-pm.png

The group presentation also used Spark, but I am not sure if they used the
same port approach, this is all I have:

*PackageToParquetRunner* <-> getFilePaths() <-> datalake [RDD
<String,String>]

*PackageToParquetRunner *-> FileProcessorClass -> RDD Filter -> RDDflatMap
-> RDDMap -> RDD <row> -> *PackageToParquetRunner *-> Create Data Frame
(SQL Context) -> Write Parquet (DataFrame).

When you say,

then running periodic jobs to build Parquet data sets.

Would such Spark setup be the case as period jobs? I am minimally
acquainted on how Spark goes about MapReduce using RDDs, but I am not
certain to what extent it would support the NiFi pipeline for such purpose
(not to mention, on the way it appears, seems to leave a hole in NiFi
diagram as a port, which makes it unable to monitor for data provenance).

---

Do you think these details and Kite details would be worth mentioning as a
comment on the JIRA issue you pointed out?

Thanks!

On Tue, Feb 14, 2017 at 11:46 AM, James Wing <jv...@gmail.com> wrote:

> Carlos,
>
> Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
> built-in solution for writing Parquet files from NiFi.
>
> I'm not an expert on Parquet, but I understand columnar formats like
> Parquet and ORC are not easily written to in the incremental, streaming
> fashion that NiFi excels at (I hope writing this will prompt expert
> correction).  Other alternatives typically involve NiFi writing to more
> stream-friendly data stores or formats directly, then running periodic jobs
> to build Parquet data sets.  Hive, Drill, and similar tools can do this.
>
> You are certainly not alone in wanting better Parquet support, there is at
> least one JIRA ticket for it as well:
>
> Add processors for Google Cloud Storage Fetch/Put/Delete
> https://issues.apache.org/jira/browse/NIFI-2725
>
> You might want to chime in with some details of your use case, or create a
> new ticket if that's not a fit for you.
>
> Thanks,
>
> James
>
> On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <cv...@hawaii.edu> wrote:
>
>> Hi,
>>
>> Our group has recently started trying to prototype a setup of
>> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
>> documentation other than a scarce discussion on using Kite as a workaround
>> to integrate NiFi and Parquet.
>>
>> Are there any future plans for this integration from NiFi or anyone would
>> be able to give me some insight in which scenario this workaround would
>> (not) be worthwhile and alternatives?
>>
>> The most recent discussion
>> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
>> I found in this list dates from May 11, 2016. I also saw some interest in
>> doing this on Stackoverflow here
>> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
>> and here
>> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
>> .
>>
>> Thanks,
>>
>> --
>> Carlos Paradis
>> http://carlosparadis.com <http://carlosandrade.co>
>>
>
>

-- 
Carlos Paradis
http://carlosparadis.com <http://carlosandrade.co>

Re: Integration between Apache NiFi and Parquet or Workaround?

Posted by James Wing <jv...@gmail.com>.

Carlos,

Welcome to NiFi!  I believe the Kite dataset is currently the most direct,
built-in solution for writing Parquet files from NiFi.

I'm not an expert on Parquet, but I understand columnar formats like
Parquet and ORC are not easily written to in the incremental, streaming
fashion that NiFi excels at (I hope writing this will prompt expert
correction).  Other alternatives typically involve NiFi writing to more
stream-friendly data stores or formats directly, then running periodic jobs
to build Parquet data sets.  Hive, Drill, and similar tools can do this.

You are certainly not alone in wanting better Parquet support, there is at
least one JIRA ticket for it as well:

Add processors for Google Cloud Storage Fetch/Put/Delete
https://issues.apache.org/jira/browse/NIFI-2725

You might want to chime in with some details of your use case, or create a
new ticket if that's not a fit for you.

Thanks,

James

On Mon, Feb 13, 2017 at 3:13 PM, Carlos Paradis <cv...@hawaii.edu> wrote:

> Hi,
>
> Our group has recently started trying to prototype a setup of
> Hadoop+Spark+NiFi+Parquet and I have been having trouble finding any
> documentation other than a scarce discussion on using Kite as a workaround
> to integrate NiFi and Parquet.
>
> Are there any future plans for this integration from NiFi or anyone would
> be able to give me some insight in which scenario this workaround would
> (not) be worthwhile and alternatives?
>
> The most recent discussion
> <http://apache-nifi-developer-list.39713.n7.nabble.com/parquet-format-td10145.html>
> I found in this list dates from May 11, 2016. I also saw some interest in
> doing this on Stackoverflow here
> <http://stackoverflow.com/questions/37149331/apache-nifi-hdfs-parquet-format>,
> and here
> <http://stackoverflow.com/questions/37165764/convert-incoming-message-to-parquet-format>
> .
>
> Thanks,
>
> --
> Carlos Paradis
> http://carlosparadis.com <http://carlosandrade.co>
>