You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Liam Friel <li...@gmail.com> on 2011/07/14 19:10:02 UTC

Advice on processing large XML files for Hive ingest

Hi,

I have data files each consisting of scads of very long records, each record
being an XML doc in it's own right.
These XML docs have a complex structure: something like this

<record>
  <sec1>
      <foo>
         <bar id="asd"><stuff><MORE STUFF></bar>
         <bar ... >
      </foo>
  </sec1>
  <sec2>
  </sec2>
  ...
  <secN>
  </secN>
</record>

(except without the line breaks)
These are generated by another system, aggregated by flume and dumped into
HDFS.

Anyhoo ... I'd like to load up this entire thing into Hive tables.
Logically, the <sec> sections fit reasonably well into individual tables and
this matches with the sorts of reports and data mining we want to do over
the data.

To start with, writing Java code is not really on. While I speak several
programming languages I am not fluent in Java or proficient in Java
development, so I plan to do any map/reduce steps necessary using Streaming
and Python.

I've looked into this, and done some proof of concept work using
Streaming/Python. I am fairly new to HDFS/Hadoop.

One approach which would definitely work would be to run a streaming
mapper-only job per table to be produced. Each streaming job produces a
directory of part-xxx files, and we import these files into Hive.
However the disadvantage it seems to me is that we have to process each data
file multiple times, each time spitting out one "type" of Hive table.

This seems a bit inefficient: what I think I really want is to have a single
map/reduce job send it's output to different directories in HDFS. Perhaps by
having a mapper send it's output to a different file depending on the key it
was dealing with?

Is this possible?
Or do I have to just launch a map/reduce job (well, map only actually) for
each output directory I want, and have each directory contain a single type
of Hive table input.

Thanks for any hints.

Regards
Liam