You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-user@hadoop.apache.org by Adrian Hains <ah...@gmail.com> on 2013/09/20 18:40:41 UTC
Transforming xml payload from flume events (avro) to custom avro structure

Summary:
I want to harvest a subset of custom data from an avro container structure,
and store it off as avro data files. I'm having some difficulty in
determining the cleanest place to implement my logic.

Details:
I have a flume flow that is shipping a somewhat complex set of xml
structures to hdfs, by way of a custom avro wrapper. I wish to pull certain
values out of the xml structures to hive tables for reporting. To get to my
xml data within the hdfs files of flume events I need to logically:
(1) unmarshal the flume event to get the byte array of the _body_ field
(2) unmarshal these bytes from the _body_ field to my custom avro wrapper
structure
(3) navigate through my wrapper structure to locate the specific xml
payload I need to harvest data from. The xml itself is binary serialized as
Fast Infoset.
(4) unmarshall the Fast Infoset encoded xml to POJOs (or xml) and pull
certain values out to store in a hive table. It is semi-structured data
that is primarily atomic values and secondarily some lists and other
structures.

I intend to store the set of values from #4 as an avro data file, and
leverage it as a hive table by using
 org.apache.hadoop.hive.serde2.avro.AvroSerDe. I could alternatively store
this data as flat files, but I'm not sure I receive a benefit from it.

I was originally thinking I would use a custom SerDe with hive that would
allow me to read in the flume event avro structures, harvest the desired
data, and write out the custom avro structure that representes my reduced
dataset. Then at reporting time I would use a typical AvroSerDe with a
schema that describes this reduced dataset.
After experimenting in hive with SERDE=AvroSerDe,
INTPUTFORMAT=AvroContainerInputFormat, and
OUTPUTFORMAT=AvroContainerOutputFormat, I see that they are all rather tied
together on one avro schema (used for reading, writing, and providing the
table metadata). This makes it difficult to use the flume event avro schema
on reading and my custom avro schema (for the transformed structure) on
writing. I think I can still work with this by implementing my own
intputformat that is specialized for flume events (ignores the avro schema
defined in the table properties), but I'm wondering if I am considering the
wrong tool for the job and would be better off with a custom map job (or
something else).

I'm not sure if anyone has experience with requirements along these lines,
but if so I would love to hear what you learned!

Cheers,
Adrian Hains