You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hive.apache.org by John Vines <jv...@umbc.edu> on 2011/04/14 21:42:03 UTC

Using hive as a mapreduce backend

Our environment is heavy into storing data in hive. I find myself currently
working on something that it outside the scope though. I have a mapreduce
written, but it requires a lot of direct user inputs for information that
could easily be scraped from Hive. That said, when I query hive for extended
table data, all of the extended information is thrown out in 1 or 2 columns
as a giant blob of almost-JSON. Is there either a convenient way to parse
this information, or better yet, get it directly in a more direct manor?

Alternatively, if I could get pointed to documentation on manually using the
CombinedHiveInputFormat, that would simplify my code a lot more. But it
seems like that InputFormat is solely used inside of Hive, using it's custom
structs.

Ultimately, what I want is to know table names, columns (not including
partitions), and partition locations for the split a mapper is working on.
If there is yet another way to accomplish this, I am eager to know.

Thanks!
John