You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Brian Femiano <bf...@gmail.com> on 2013/05/04 05:30:24 UTC

Accumulo Hive Storage Handler

Use Hive to directly and efficiently query data stored in Accumulo tables.

See the Getting Started Guide and required AUX_JARS list. The homepage also
lists the current limitations.

I've submitted a patch ACCUMULO-143 to get this directly into Accumulo
trunk, but for now people can experiment with it at:
https://github.com/bfemiano/accumulo-hive-storage-manager.

The CREATE EXTERNAL TABLE keywords allows Hive to create a metastore entry
for the Accumulo table, which 'theoretically' suggests you could use
Cloudera Impala directly with Accumulo. I have not tested this though.

Re: Accumulo Hive Storage Handler

Posted by Brian Femiano <bf...@gmail.com>.

Hey Jason,

I haven't really stressed the HBase and Accumulo storage handler queries
over any respectible scale where the differences would be dramatically
pronounced.

One advantage the AccumuloStorageHandler has over the HBase handler is the
ability to pushdown predicates involving more than just the RowID.

A WHERE clause of the form of "rowid > '5555' AND rowid < '7777' AND name =
'brian' " when delegated to the HBase handler would only filter based on
the rowID, and ignore the name qualifier restriction. It's not designed to
handle predicates involving columns other than the mapped rowID.

The AccumuloPredicateHandler goes the extra mile to support qualifiers. Not
only are the rowID comparisons built into custom Range restrictions, but an
additional filter iterator is added to ignore rows that don't contain a
qualifier name exactly equal to 'brian'.

The HBase handler is more evolved in many other storage handler components,
but with respect to Predicate pushdown optimization, I believe the Accumulo
implementation to be a bit stronger. You're right though, I should really
back that up with some metrics.

On Sat, May 4, 2013 at 12:16 PM, Jason Trost <ja...@gmail.com> wrote:

> Hey Brian,
>
> This is pretty cool.  Just out of curiosity do you have any performance
> numbers for this compared to Hive over files or other datastores?  I am
> curious how much the iterators speed things with Predicate pushdowns.
>
> Thanks,
>
> --Jason
>
>
>
> On Fri, May 3, 2013 at 11:30 PM, Brian Femiano <bf...@gmail.com> wrote:
>
> > Use Hive to directly and efficiently query data stored in Accumulo
> tables.
> >
> > See the Getting Started Guide and required AUX_JARS list. The homepage
> also
> > lists the current limitations.
> >
> > I've submitted a patch ACCUMULO-143 to get this directly into Accumulo
> > trunk, but for now people can experiment with it at:
> > https://github.com/bfemiano/accumulo-hive-storage-manager.
> >
> > The CREATE EXTERNAL TABLE keywords allows Hive to create a metastore
> entry
> > for the Accumulo table, which 'theoretically' suggests you could use
> > Cloudera Impala directly with Accumulo. I have not tested this though.
> >
>

Re: Accumulo Hive Storage Handler

Posted by Jason Trost <ja...@gmail.com>.

Hey Brian,

This is pretty cool.  Just out of curiosity do you have any performance
numbers for this compared to Hive over files or other datastores?  I am
curious how much the iterators speed things with Predicate pushdowns.

Thanks,

--Jason

On Fri, May 3, 2013 at 11:30 PM, Brian Femiano <bf...@gmail.com> wrote:

> Use Hive to directly and efficiently query data stored in Accumulo tables.
>
> See the Getting Started Guide and required AUX_JARS list. The homepage also
> lists the current limitations.
>
> I've submitted a patch ACCUMULO-143 to get this directly into Accumulo
> trunk, but for now people can experiment with it at:
> https://github.com/bfemiano/accumulo-hive-storage-manager.
>
> The CREATE EXTERNAL TABLE keywords allows Hive to create a metastore entry
> for the Accumulo table, which 'theoretically' suggests you could use
> Cloudera Impala directly with Accumulo. I have not tested this though.
>