You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Aaron McCurry <am...@gmail.com> on 2015/05/17 18:26:51 UTC

Custom Hive Storage Handler

I'm currently developing a new storage handler, serde and input and output
format for Apache Blur.  I'm having a problem integrating hive with Blur's
bulk ingestion process.  The ingestion process is MR based and requires
reading input files from Blur and mixing the data with new inbound data.

The Blur MR maps multiple inputs and requires new data + existing data from
Blur then reducer sorts the new data and indexes it.  During the reduce any
existing data that does not need to be re-ingested is ignored.  The new
index files are then loaded into Blur.  The reason for the file based
indexing is for large scale ingestion throughput which is much higher than
forcing the blur processes to perform the index updates.

Currently the Hive integration that uses this feature simply dumps the
output from hive into a tmp location in hdfs.  Then an external process
runs the custom MR job to load data into Blur.

If feels like I would have to somehow add and extra task in the query plan,
but there doesn't seem to be a clean way to access this part of Hive.  Is
there a cleaner to integrate the two jobs into a single job from Hive?  Or
am I stuck with this two step process?

Thanks!

Aaron