You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@nifi.apache.org by Mike Harding <mi...@gmail.com> on 2016/03/02 11:33:12 UTC

Nifi JSON event storage in HDFS

Hi All,

I currently have a small hadoop cluster running with HDFS and Hive. My
ultimate goal is to leverage NiFi's ingestion and flow capabilities to
store real-time external JSON formatted event data.

What I am unclear about is what the best strategy/design is for storing
FlowFile data (i.e. JSON events in my case) within HDFS that can then be
accessed and analysed in Hive tables.

Is much of the design in terms of storage handled in the NiFi flow or do I
need to set something up external of NiFi to ensure I can query each JSON
formatted event as a record in a Hive log table for example?

Any examples or suggestions much appreciated,

Thanks,
M

Re: Nifi JSON event storage in HDFS

Posted by Simon Ball <sb...@hortonworks.com>.

I've been doing a lot of this recently into both hive and spark.

One thing that will make life a lot easier is to use the JSON record file format, this is essentially just a JSON document per line of a text file, which means you can use nifi's MergeContent processor to handle batching into hdfs. Avro also makes a lot of sense and can be generated directly out of nifi.

Simon

-
Simon Elliston Ball
Product Solutions Architect
+44 7930 424111<tel:+44%207930%20424111>
Hortonworks - Powering the future of data


On 2 Mar 2016, at 11:33, Mike Harding <mi...@gmail.com>> wrote:

Hi All,

I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data.

What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables.

Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example?

Any examples or suggestions much appreciated,

Thanks,
M

Re: Nifi JSON event storage in HDFS

Posted by Christopher Wilson <wi...@gmail.com>.

I used the ConvertJsonToAvro and PutHDFS processors to land files into a
Hive warehouse . Once you get the AVRO schema right it's easy.  Look at the
avro-tools jar file to help with the schema.

Chris

On Wed, Mar 2, 2016, 4:59 AM Conrad Crampton <co...@secdata.com>
wrote:

> Hi,
> I have similar specifications about SQL access – those specifying this
> keep saying Hive, but I don’t believe that is the requirement (typical
> developer knowing best eh?) - I think it is just SQL access that is
> required. Drill is more flexible (in my opinion – I am not affiliated to
> Drill in any way) and has drivers for tooling access too (in a similar way
> Hive has). There is Spark support for Avro too.
> I’ll be interested to follow your progress on this.
> Conrad
>
> From: Mike Harding <mi...@gmail.com>
> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Date: Wednesday, 2 March 2016 at 10:54
> To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Subject: Re: Nifi JSON event storage in HDFS
>
> Hi Conrad,
>
> Thanks for the heads up, I will investigate Apache Drill. I also forgot to
> mention that I have downstream requirements about which tools the data
> modellers are comfortable using - they want to use Hive and Spark as the
> data access engines primarily so the data needs to be persisted in HDFS in
> a way that it can be easily accessed by these services.
>
> But your right - there is multiple ways of doing this and I'm hoping NiFi
> would help scope/simplify the pipeline design.
>
> Cheers,
> M
>
> On 2 March 2016 at 10:38, Conrad Crampton <co...@secdata.com>
> wrote:
>
>> Hi,
>> I am doing something similar, but having wrestled with Hive data
>> population (not from NiFi) and its performance I am currently looking at
>> Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar
>> size to yours). To this end, I have chosen Avro as my ‘persistence’ format
>> and using a number of processors to get from raw data though mapping
>> attributes to json to avro (via schemas) and ultimately storing in HDFS.
>> Querying this with Drill is a breeze then as the schema is already
>> specified within the data which Drill understands. The schema can also be
>> extended without impacting existing data too.
>> HTH – I’m sure there are a ton of other ways to skin this particular cat
>> though,
>> Conrad
>>
>> From: Mike Harding <mi...@gmail.com>
>> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Date: Wednesday, 2 March 2016 at 10:33
>> To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Subject: Nifi JSON event storage in HDFS
>>
>> Hi All,
>>
>> I currently have a small hadoop cluster running with HDFS and Hive. My
>> ultimate goal is to leverage NiFi's ingestion and flow capabilities to
>> store real-time external JSON formatted event data.
>>
>> What I am unclear about is what the best strategy/design is for storing
>> FlowFile data (i.e. JSON events in my case) within HDFS that can then be
>> accessed and analysed in Hive tables.
>>
>> Is much of the design in terms of storage handled in the NiFi flow or do
>> I need to set something up external of NiFi to ensure I can query each JSON
>> formatted event as a record in a Hive log table for example?
>>
>> Any examples or suggestions much appreciated,
>>
>> Thanks,
>> M
>>
>>
>> ***This email originated outside SecureData***
>>
>> Click here <https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to
>> report this email as spam.
>>
>>
>> SecureData, combating cyber threats
>>
>> ------------------------------
>>
>> The information contained in this message or any of its attachments may
>> be privileged and confidential and intended for the exclusive use of the
>> intended recipient. If you are not the intended recipient any disclosure,
>> reproduction, distribution or other dissemination or use of this
>> communications is strictly prohibited. The views expressed in this email
>> are those of the individual and not necessarily of SecureData Europe Ltd.
>> Any prices quoted are only valid if followed up by a formal written quote.
>>
>> SecureData Europe Limited. Registered in England & Wales 04365896.
>> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
>> Maidstone, Kent, ME16 9NT
>>
>
>

Re: Nifi JSON event storage in HDFS

Posted by Conrad Crampton <co...@SecData.com>.

Hi,
I have similar specifications about SQL access – those specifying this keep saying Hive, but I don’t believe that is the requirement (typical developer knowing best eh?) - I think it is just SQL access that is required. Drill is more flexible (in my opinion – I am not affiliated to Drill in any way) and has drivers for tooling access too (in a similar way Hive has). There is Spark support for Avro too.
I’ll be interested to follow your progress on this.
Conrad

From: Mike Harding <mi...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, 2 March 2016 at 10:54
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Re: Nifi JSON event storage in HDFS

Hi Conrad,

Thanks for the heads up, I will investigate Apache Drill. I also forgot to mention that I have downstream requirements about which tools the data modellers are comfortable using - they want to use Hive and Spark as the data access engines primarily so the data needs to be persisted in HDFS in a way that it can be easily accessed by these services.

But your right - there is multiple ways of doing this and I'm hoping NiFi would help scope/simplify the pipeline design.

Cheers,
M

On 2 March 2016 at 10:38, Conrad Crampton <co...@secdata.com>> wrote:
Hi,
I am doing something similar, but having wrestled with Hive data population (not from NiFi) and its performance I am currently looking at Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar size to yours). To this end, I have chosen Avro as my ‘persistence’ format and using a number of processors to get from raw data though mapping attributes to json to avro (via schemas) and ultimately storing in HDFS. Querying this with Drill is a breeze then as the schema is already specified within the data which Drill understands. The schema can also be extended without impacting existing data too.
HTH – I’m sure there are a ton of other ways to skin this particular cat though,
Conrad

From: Mike Harding <mi...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, 2 March 2016 at 10:33
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Nifi JSON event storage in HDFS

Hi All,

I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data.

What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables.

Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example?

Any examples or suggestions much appreciated,

Thanks,
M

***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam.

SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT

Re: Nifi JSON event storage in HDFS

Posted by Andrew Grande <ag...@hortonworks.com>.

Sumo,

True, MapR FS implementation may have compatibility issues. Additionally, things are complicated by a need to bundle some of their proprietary jars which can't be redistributed with NiFi.

We at Hortonworks, have enabled some of our customers to have NiFi and MapR working together before, maybe check with your friendly support engineer for details?

Andrew

From: Sumanth Chinthagunta <xm...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, March 2, 2016 at 9:40 PM
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Re: Nifi JSON event storage in HDFS

I am exploring to use kite processor to store data into Hadoop. I hope this lets me change storage engine form hdfs to hive to hbase later. Since my Hadoop distribution is MapR, I didn't have full success yet.
Sumo


Sent from my iPhone

On Mar 2, 2016, at 2:54 AM, Mike Harding <mi...@gmail.com>> wrote:

Hi Conrad,

Thanks for the heads up, I will investigate Apache Drill. I also forgot to mention that I have downstream requirements about which tools the data modellers are comfortable using - they want to use Hive and Spark as the data access engines primarily so the data needs to be persisted in HDFS in a way that it can be easily accessed by these services.

But your right - there is multiple ways of doing this and I'm hoping NiFi would help scope/simplify the pipeline design.

Cheers,
M

On 2 March 2016 at 10:38, Conrad Crampton <co...@secdata.com>> wrote:
Hi,
I am doing something similar, but having wrestled with Hive data population (not from NiFi) and its performance I am currently looking at Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar size to yours). To this end, I have chosen Avro as my ‘persistence’ format and using a number of processors to get from raw data though mapping attributes to json to avro (via schemas) and ultimately storing in HDFS. Querying this with Drill is a breeze then as the schema is already specified within the data which Drill understands. The schema can also be extended without impacting existing data too.
HTH – I’m sure there are a ton of other ways to skin this particular cat though,
Conrad

From: Mike Harding <mi...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, 2 March 2016 at 10:33
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Nifi JSON event storage in HDFS

Hi All,

I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data.

What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables.

Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example?

Any examples or suggestions much appreciated,

Thanks,
M



***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam.


SecureData, combating cyber threats

________________________________

The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT

Re: Nifi JSON event storage in HDFS

Posted by Sumanth Chinthagunta <xm...@gmail.com>.

I am exploring to use kite processor to store data into Hadoop. I hope this lets me change storage engine form hdfs to hive to hbase later. Since my Hadoop distribution is MapR, I didn't have full success yet.
Sumo
 

Sent from my iPhone

> On Mar 2, 2016, at 2:54 AM, Mike Harding <mi...@gmail.com> wrote:
> 
> Hi Conrad,
> 
> Thanks for the heads up, I will investigate Apache Drill. I also forgot to mention that I have downstream requirements about which tools the data modellers are comfortable using - they want to use Hive and Spark as the data access engines primarily so the data needs to be persisted in HDFS in a way that it can be easily accessed by these services.
> 
> But your right - there is multiple ways of doing this and I'm hoping NiFi would help scope/simplify the pipeline design.
> 
> Cheers,
> M
> 
>> On 2 March 2016 at 10:38, Conrad Crampton <co...@secdata.com> wrote:
>> Hi,
>> I am doing something similar, but having wrestled with Hive data population (not from NiFi) and its performance I am currently looking at Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar size to yours). To this end, I have chosen Avro as my ‘persistence’ format and using a number of processors to get from raw data though mapping attributes to json to avro (via schemas) and ultimately storing in HDFS. Querying this with Drill is a breeze then as the schema is already specified within the data which Drill understands. The schema can also be extended without impacting existing data too.
>> HTH – I’m sure there are a ton of other ways to skin this particular cat though,
>> Conrad
>> 
>> From: Mike Harding <mi...@gmail.com>
>> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Date: Wednesday, 2 March 2016 at 10:33
>> To: "users@nifi.apache.org" <us...@nifi.apache.org>
>> Subject: Nifi JSON event storage in HDFS
>> 
>> Hi All,
>> 
>> I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data.
>> 
>> What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables.
>> 
>> Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example?
>> 
>> Any examples or suggestions much appreciated,
>> 
>> Thanks,
>> M
>> 
>> 
>> ***This email originated outside SecureData***
>> 
>> Click here to report this email as spam.
>> 
>> 
>> 
>> SecureData, combating cyber threats
>> 
>> The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.
>> 
>> SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT
>> 
>

Re: Nifi JSON event storage in HDFS

Posted by Mike Harding <mi...@gmail.com>.

Hi Conrad,

Thanks for the heads up, I will investigate Apache Drill. I also forgot to
mention that I have downstream requirements about which tools the data
modellers are comfortable using - they want to use Hive and Spark as the
data access engines primarily so the data needs to be persisted in HDFS in
a way that it can be easily accessed by these services.

But your right - there is multiple ways of doing this and I'm hoping NiFi
would help scope/simplify the pipeline design.

Cheers,
M

On 2 March 2016 at 10:38, Conrad Crampton <co...@secdata.com>
wrote:

> Hi,
> I am doing something similar, but having wrestled with Hive data
> population (not from NiFi) and its performance I am currently looking at
> Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar
> size to yours). To this end, I have chosen Avro as my ‘persistence’ format
> and using a number of processors to get from raw data though mapping
> attributes to json to avro (via schemas) and ultimately storing in HDFS.
> Querying this with Drill is a breeze then as the schema is already
> specified within the data which Drill understands. The schema can also be
> extended without impacting existing data too.
> HTH – I’m sure there are a ton of other ways to skin this particular cat
> though,
> Conrad
>
> From: Mike Harding <mi...@gmail.com>
> Reply-To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Date: Wednesday, 2 March 2016 at 10:33
> To: "users@nifi.apache.org" <us...@nifi.apache.org>
> Subject: Nifi JSON event storage in HDFS
>
> Hi All,
>
> I currently have a small hadoop cluster running with HDFS and Hive. My
> ultimate goal is to leverage NiFi's ingestion and flow capabilities to
> store real-time external JSON formatted event data.
>
> What I am unclear about is what the best strategy/design is for storing
> FlowFile data (i.e. JSON events in my case) within HDFS that can then be
> accessed and analysed in Hive tables.
>
> Is much of the design in terms of storage handled in the NiFi flow or do I
> need to set something up external of NiFi to ensure I can query each JSON
> formatted event as a record in a Hive log table for example?
>
> Any examples or suggestions much appreciated,
>
> Thanks,
> M
>
>
> ***This email originated outside SecureData***
>
> Click here <https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to
> report this email as spam.
>
>
> SecureData, combating cyber threats
>
> ------------------------------
>
> The information contained in this message or any of its attachments may be
> privileged and confidential and intended for the exclusive use of the
> intended recipient. If you are not the intended recipient any disclosure,
> reproduction, distribution or other dissemination or use of this
> communications is strictly prohibited. The views expressed in this email
> are those of the individual and not necessarily of SecureData Europe Ltd.
> Any prices quoted are only valid if followed up by a formal written quote.
>
> SecureData Europe Limited. Registered in England & Wales 04365896.
> Registered Address: SecureData House, Hermitage Court, Hermitage Lane,
> Maidstone, Kent, ME16 9NT
>

Re: Nifi JSON event storage in HDFS

Posted by Conrad Crampton <co...@SecData.com>.

Hi,
I am doing something similar, but having wrestled with Hive data population (not from NiFi) and its performance I am currently looking at Apache Drill as my SQL abstraction layer over my Hadoop cluster (similar size to yours). To this end, I have chosen Avro as my ‘persistence’ format and using a number of processors to get from raw data though mapping attributes to json to avro (via schemas) and ultimately storing in HDFS. Querying this with Drill is a breeze then as the schema is already specified within the data which Drill understands. The schema can also be extended without impacting existing data too.
HTH – I’m sure there are a ton of other ways to skin this particular cat though,
Conrad

From: Mike Harding <mi...@gmail.com>>
Reply-To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Date: Wednesday, 2 March 2016 at 10:33
To: "users@nifi.apache.org<ma...@nifi.apache.org>" <us...@nifi.apache.org>>
Subject: Nifi JSON event storage in HDFS

Hi All,

I currently have a small hadoop cluster running with HDFS and Hive. My ultimate goal is to leverage NiFi's ingestion and flow capabilities to store real-time external JSON formatted event data.

What I am unclear about is what the best strategy/design is for storing FlowFile data (i.e. JSON events in my case) within HDFS that can then be accessed and analysed in Hive tables.

Is much of the design in terms of storage handled in the NiFi flow or do I need to set something up external of NiFi to ensure I can query each JSON formatted event as a record in a Hive log table for example?

Any examples or suggestions much appreciated,

Thanks,
M

***This email originated outside SecureData***

Click here<https://www.mailcontrol.com/sr/MZbqvYs5QwJvpeaetUwhCQ==> to report this email as spam.

SecureData, combating cyber threats
______________________________________________________________________ 
The information contained in this message or any of its attachments may be privileged and confidential and intended for the exclusive use of the intended recipient. If you are not the intended recipient any disclosure, reproduction, distribution or other dissemination or use of this communications is strictly prohibited. The views expressed in this email are those of the individual and not necessarily of SecureData Europe Ltd. Any prices quoted are only valid if followed up by a formal written quote.

SecureData Europe Limited. Registered in England & Wales 04365896. Registered Address: SecureData House, Hermitage Court, Hermitage Lane, Maidstone, Kent, ME16 9NT