You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flume.apache.org by Peter Eisentraut <pe...@eisentraut.org> on 2013/05/30 22:27:26 UTC

best way to put UDP JSON data into Hadoop

I have a use case for Flume and I'm wondering which of the many options
in Flume to use for making this work.

I have a data source that produces log data in UDP packets containing
JSON (a bit like syslog, but the data is already structured).  I want to
get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet).

My first attempt was to write a sink (based on the syslog UDP sink) that
receives UDP packets, parses the JSON, stuffs the fields into the
headers of the internal Flume event object, and sends it off.  (The body
is left empty.)  On the receiving end, I wrote a serializer for the
hbase sink that writes each header field into a separate column.  That
works, but I was confused that the default supplied hbase serializers
ignored all event headers, so I was wondering whether I'm abusing them.

An alternative approach I was thinking about was writing a generic UDP
sink that stuffs the entire UDP packet into the event body, and then
write a serializer for the hbase sink that parses the JSON and puts the
fields into the columns.  Or alternatively write the JSON straight into
HDFS and have Hive to the JSON parsing later.

Which one of these would be more idiomatic and/or generally useful?

RE: best way to put UDP JSON data into Hadoop

Posted by Phil Scala <Ph...@globalrelay.net>.

Paul has some more "real life" examples, I will give some thoughts based more ideas...which may or may not be valuable...

I think you will need to create some custom source that will accept the JSON over UDP.  Looking at some of the current sources out there I think you can use the designs from the HTTP Source and the Syslog UDP sources.  The HTTP source has handlers to support JSON if you wanted to parse the JSON up front before writing the event to the channel.   This though still seems to then require your custom hbase sink.  I'd let the JSON go right on through.

Now,on the SINK side, you can use the  HBaseSink.  if you wanted to re-use the RegexHbaseEventSerializer 

      The RegexHbaseEventSerializer (org.apache.flume.sink.hbase.RegexHbaseEventSerializer) breaks 
       the event body based on the given regex and writes each part into different columns.] 

I think you could get away with that and parse the JSON using regex.  However, I think the regex may end up being too complicated.  So as an alternative, do something similar to the RegexHbaseEventSerializer but rather parse the JSON using native JSON parsers, then write each JSON property to the row (take a look at the RegexHbaseEventSerializer.getActions() method)

Hope that helps...


Phil Scala
Global Relay

phil.scala@globalrelay.net

866.484.6630  |  info@globalrelay.net  |  globalrelay.com 

-----Original Message-----
From: Paul Chavez [mailto:pchavez@verticalsearchworks.com] 
Sent: Thursday, May 30, 2013 4:43 PM
To: user@flume.apache.org
Subject: RE: best way to put UDP JSON data into Hadoop

I can't speak to the UDP transport mechanism, but we do use JSON events with Hive and it works quite well. 

In our case we have an application that takes an internal object, serializes it to JSON, puts that JSON into another object we call the 'flume envelope' which has timestamp and a couple other headers for routing. We use an HTTPSource to POST the JSON 'envelope' events to flume, which never does anything special with the JSON 'payload'. On the sink side, after a couple Avro hops we serialize to TEXT files with the HDFS sink. Then we use a Hive JSON SerDe to create an external table (flume is configured to write to partitions based on the timestamp). Every hour an Oozie job processes the previous hour data into a 'native' Hive table and then we drop the external partition and data. The only catch is the JSON events have to be on a single line. 

This overall workflow has proven to be extremely useful and flexible. We manage multiple data flows with a single source/channel/sink by writing to paths based on the envelope headers. (eg /flume/%{logType}/%{logSubType}/date=%Y%M%d/hour=%H)

Hope that helps!
Paul Chavez

-----Original Message-----
From: Peter Eisentraut [mailto:peter@eisentraut.org] 
Sent: Thursday, May 30, 2013 1:27 PM
To: user@flume.apache.org
Subject: best way to put UDP JSON data into Hadoop

I have a use case for Flume and I'm wondering which of the many options in Flume to use for making this work.

I have a data source that produces log data in UDP packets containing JSON (a bit like syslog, but the data is already structured).  I want to get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet).

My first attempt was to write a sink (based on the syslog UDP sink) that receives UDP packets, parses the JSON, stuffs the fields into the headers of the internal Flume event object, and sends it off.  (The body is left empty.)  On the receiving end, I wrote a serializer for the hbase sink that writes each header field into a separate column.  That works, but I was confused that the default supplied hbase serializers ignored all event headers, so I was wondering whether I'm abusing them.

An alternative approach I was thinking about was writing a generic UDP sink that stuffs the entire UDP packet into the event body, and then write a serializer for the hbase sink that parses the JSON and puts the fields into the columns.  Or alternatively write the JSON straight into HDFS and have Hive to the JSON parsing later.

Which one of these would be more idiomatic and/or generally useful?

RE: best way to put UDP JSON data into Hadoop

Posted by Paul Chavez <pc...@verticalsearchworks.com>.

I can't speak to the UDP transport mechanism, but we do use JSON events with Hive and it works quite well. 

In our case we have an application that takes an internal object, serializes it to JSON, puts that JSON into another object we call the 'flume envelope' which has timestamp and a couple other headers for routing. We use an HTTPSource to POST the JSON 'envelope' events to flume, which never does anything special with the JSON 'payload'. On the sink side, after a couple Avro hops we serialize to TEXT files with the HDFS sink. Then we use a Hive JSON SerDe to create an external table (flume is configured to write to partitions based on the timestamp). Every hour an Oozie job processes the previous hour data into a 'native' Hive table and then we drop the external partition and data. The only catch is the JSON events have to be on a single line. 

This overall workflow has proven to be extremely useful and flexible. We manage multiple data flows with a single source/channel/sink by writing to paths based on the envelope headers. (eg /flume/%{logType}/%{logSubType}/date=%Y%M%d/hour=%H)

Hope that helps!
Paul Chavez

-----Original Message-----
From: Peter Eisentraut [mailto:peter@eisentraut.org] 
Sent: Thursday, May 30, 2013 1:27 PM
To: user@flume.apache.org
Subject: best way to put UDP JSON data into Hadoop

I have a use case for Flume and I'm wondering which of the many options in Flume to use for making this work.

I have a data source that produces log data in UDP packets containing JSON (a bit like syslog, but the data is already structured).  I want to get this into Hadoop somehow (either HBase or HDFS+Hive, not sure yet).

My first attempt was to write a sink (based on the syslog UDP sink) that receives UDP packets, parses the JSON, stuffs the fields into the headers of the internal Flume event object, and sends it off.  (The body is left empty.)  On the receiving end, I wrote a serializer for the hbase sink that writes each header field into a separate column.  That works, but I was confused that the default supplied hbase serializers ignored all event headers, so I was wondering whether I'm abusing them.

An alternative approach I was thinking about was writing a generic UDP sink that stuffs the entire UDP packet into the event body, and then write a serializer for the hbase sink that parses the JSON and puts the fields into the columns.  Or alternatively write the JSON straight into HDFS and have Hive to the JSON parsing later.

Which one of these would be more idiomatic and/or generally useful?