You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@drill.apache.org by Carol McDonald <cm...@maprtech.com> on 2015/01/20 16:56:07 UTC

Re: String for HBase row key

what if the HBase primary key is a composite key  composed of multiple
types , for example  a string followed by a reverse timestamp (long)  like
AMZN_9223370655563575807,

are there parameters to specify the length in the function  convert_from(string
bytea, src_encoding name)



On Thu, Dec 18, 2014 at 12:22 AM, Jacques Nadeau <ja...@apache.org> wrote:

> String keys work but aren't the most performant or appropriate encoding to
> use in many cases.  Drill provides CONVERT_TO and CONVERT_FROM with a large
> number of encodings (including those use by many Hadoop applications as
> well the Apache Phoenix project).  This improves performance of data use in
> HBase.  You can use strings but you should use an encoding appropriate to
> your actual data.  Drill will then do projection pushdown, filter pushdown
> and range pruning based on your query.
>
> On Wed, Dec 17, 2014 at 8:33 AM, Carol Bourgade <cj...@gmail.com>
> wrote:
> >
> > Implala documentation says for best performance use the string data type
> > for HBase row keys.  I know that you do not have to define the data types
> > for Drill queries , but do string bytes work better for drill queries on
> > hbase row keys ?
> >
> >
> >
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_hbase.html
> > For best performance of Impala queries against HBase tables, most queries
> > will perform comparisons in the WHERE against the column that corresponds
> > to the HBase row key. When creating the table through the Hive shell, use
> > the STRING data type for the column that corresponds to the HBase row
> key.
> > Impala can translate conditional tests (through operators such as =, <,
> > BETWEEN, and IN) against this column into fast lookups in HBase, but this
> > optimization ("predicate pushdown") only works when that column is
> defined
> > as STRING.
> >
>

Re: String for HBase row key

Posted by Hao Zhu <hz...@maprtech.com>.
Seems byte_substr function is not documented anywhere.
Should we add below example in Wiki?
For example:

In hbase shell, create a hbase or maprdb table:

create '/testtable','cf'
put '/testtable','ABCD_9223370655563575807','cf:c','abc'

Then we can get the substring of rowkey in Drill:

select convert_from(byte_substr(row_key,1,4),'UTF8') from dfs.`/testtable`;
+------------+
|   EXPR$0   |
+------------+
| ABCD       |
+------------+
1 row selected (0.293 seconds)
 select convert_from(byte_substr(row_key,6,length(row_key)),'UTF8') from
dfs.`/testtable`;
+------------+
|   EXPR$0   |
+------------+
| 9223370655563575807 |
+------------+
1 row selected (0.263 seconds)

Thanks,
Hao

On Tue, Jan 20, 2015 at 1:44 PM, Aditya <ad...@gmail.com> wrote:

> I have been experimenting with byte_substr() function to build composite
> keys as part of the HBase row key with good results.
>
> Here is a sample view for TPCH lineitem table
>
> select
>     convert_from(byte_substr(lineitem.row_key, 9, 4), 'int_be') as
> l_orderkey,
>     convert_from(byte_substr(lineitem.row_key, 14, 4), 'int_be') as
> l_partkey,
>     convert_from(byte_substr(lineitem.row_key, 18, 4), 'int_be') as
> l_suppkey,
>     convert_from(byte_substr(lineitem.row_key, 13, 1), 'tinyint') as
> l_linenumber,
>     convert_from(lineitem.F.l_quantity, 'double_be') as l_quantity,
>     convert_from(lineitem.F.l_extendedprice, 'double_be') as
> l_extendedprice,
>     convert_from(lineitem.F.l_discount, 'double_be') as l_discount,
>     convert_from(lineitem.F.l_tax, 'double_be') as l_tax,
>     convert_from(lineitem.F.l_returnflag, 'utf8') as l_returnflag,
>     convert_from(lineitem.F.l_linestatus, 'utf8') as l_linestatus,
>     convert_from(byte_substr(lineitem.row_key, 1, 8), 'date_epoch_be') as
> l_shipdate,
>     convert_from(lineitem.F.l_commitdate, 'date_epoch_be') l_commitdate,
>     convert_from(lineitem.F.l_receiptdate, 'date_epoch_be') l_receiptdate,
>     convert_from(lineitem.F.l_shipinstruct, 'utf8') as l_shipinstruct,
>     convert_from(lineitem.F.l_shipmode, 'utf8') as l_shipmode,
>     convert_from(lineitem.G.l_comment, 'utf8') as l_comment
> from hbase.lineitem lineitem;
>
> To answer, Ted's question on filter pushdown, we have identified this as a
> requirement
> i.e. to be able to pushdown predicates with byte_substr() into HBase scan
> but it is not
> yet implemented.
>
>
> On Tue, Jan 20, 2015 at 1:27 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > What about filter pushdown in these cases?  I know that some filter ops
> > push down through convert calls.  What about through byte_substr?
> >
> >
> >
> > On Tue, Jan 20, 2015 at 12:39 PM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > I believe there is byte_substr (or similar) which you could use before
> > > handing the value to convert_from
> > >
> > > On Tue, Jan 20, 2015 at 7:56 AM, Carol McDonald <
> cmcdonald@maprtech.com>
> > > wrote:
> > >
> > > > what if the HBase primary key is a composite key  composed of
> multiple
> > > > types , for example  a string followed by a reverse timestamp (long)
> > > like
> > > > AMZN_9223370655563575807,
> > > >
> > > > are there parameters to specify the length in the function
> > > > convert_from(string
> > > > bytea, src_encoding name)
> > > >
> > > >
> > > >
> > > > On Thu, Dec 18, 2014 at 12:22 AM, Jacques Nadeau <jacques@apache.org
> >
> > > > wrote:
> > > >
> > > > > String keys work but aren't the most performant or appropriate
> > encoding
> > > > to
> > > > > use in many cases.  Drill provides CONVERT_TO and CONVERT_FROM
> with a
> > > > large
> > > > > number of encodings (including those use by many Hadoop
> applications
> > as
> > > > > well the Apache Phoenix project).  This improves performance of
> data
> > > use
> > > > in
> > > > > HBase.  You can use strings but you should use an encoding
> > appropriate
> > > to
> > > > > your actual data.  Drill will then do projection pushdown, filter
> > > > pushdown
> > > > > and range pruning based on your query.
> > > > >
> > > > > On Wed, Dec 17, 2014 at 8:33 AM, Carol Bourgade <
> > cjbourgade@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > Implala documentation says for best performance use the string
> data
> > > > type
> > > > > > for HBase row keys.  I know that you do not have to define the
> data
> > > > types
> > > > > > for Drill queries , but do string bytes work better for drill
> > queries
> > > > on
> > > > > > hbase row keys ?
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_hbase.html
> > > > > > For best performance of Impala queries against HBase tables, most
> > > > queries
> > > > > > will perform comparisons in the WHERE against the column that
> > > > corresponds
> > > > > > to the HBase row key. When creating the table through the Hive
> > shell,
> > > > use
> > > > > > the STRING data type for the column that corresponds to the HBase
> > row
> > > > > key.
> > > > > > Impala can translate conditional tests (through operators such as
> > =,
> > > <,
> > > > > > BETWEEN, and IN) against this column into fast lookups in HBase,
> > but
> > > > this
> > > > > > optimization ("predicate pushdown") only works when that column
> is
> > > > > defined
> > > > > > as STRING.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: String for HBase row key

Posted by Aditya <ad...@gmail.com>.
I have been experimenting with byte_substr() function to build composite
keys as part of the HBase row key with good results.

Here is a sample view for TPCH lineitem table

select
    convert_from(byte_substr(lineitem.row_key, 9, 4), 'int_be') as
l_orderkey,
    convert_from(byte_substr(lineitem.row_key, 14, 4), 'int_be') as
l_partkey,
    convert_from(byte_substr(lineitem.row_key, 18, 4), 'int_be') as
l_suppkey,
    convert_from(byte_substr(lineitem.row_key, 13, 1), 'tinyint') as
l_linenumber,
    convert_from(lineitem.F.l_quantity, 'double_be') as l_quantity,
    convert_from(lineitem.F.l_extendedprice, 'double_be') as
l_extendedprice,
    convert_from(lineitem.F.l_discount, 'double_be') as l_discount,
    convert_from(lineitem.F.l_tax, 'double_be') as l_tax,
    convert_from(lineitem.F.l_returnflag, 'utf8') as l_returnflag,
    convert_from(lineitem.F.l_linestatus, 'utf8') as l_linestatus,
    convert_from(byte_substr(lineitem.row_key, 1, 8), 'date_epoch_be') as
l_shipdate,
    convert_from(lineitem.F.l_commitdate, 'date_epoch_be') l_commitdate,
    convert_from(lineitem.F.l_receiptdate, 'date_epoch_be') l_receiptdate,
    convert_from(lineitem.F.l_shipinstruct, 'utf8') as l_shipinstruct,
    convert_from(lineitem.F.l_shipmode, 'utf8') as l_shipmode,
    convert_from(lineitem.G.l_comment, 'utf8') as l_comment
from hbase.lineitem lineitem;

To answer, Ted's question on filter pushdown, we have identified this as a
requirement
i.e. to be able to pushdown predicates with byte_substr() into HBase scan
but it is not
yet implemented.


On Tue, Jan 20, 2015 at 1:27 PM, Ted Dunning <te...@gmail.com> wrote:

> What about filter pushdown in these cases?  I know that some filter ops
> push down through convert calls.  What about through byte_substr?
>
>
>
> On Tue, Jan 20, 2015 at 12:39 PM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > I believe there is byte_substr (or similar) which you could use before
> > handing the value to convert_from
> >
> > On Tue, Jan 20, 2015 at 7:56 AM, Carol McDonald <cm...@maprtech.com>
> > wrote:
> >
> > > what if the HBase primary key is a composite key  composed of multiple
> > > types , for example  a string followed by a reverse timestamp (long)
> > like
> > > AMZN_9223370655563575807,
> > >
> > > are there parameters to specify the length in the function
> > > convert_from(string
> > > bytea, src_encoding name)
> > >
> > >
> > >
> > > On Thu, Dec 18, 2014 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> > > wrote:
> > >
> > > > String keys work but aren't the most performant or appropriate
> encoding
> > > to
> > > > use in many cases.  Drill provides CONVERT_TO and CONVERT_FROM with a
> > > large
> > > > number of encodings (including those use by many Hadoop applications
> as
> > > > well the Apache Phoenix project).  This improves performance of data
> > use
> > > in
> > > > HBase.  You can use strings but you should use an encoding
> appropriate
> > to
> > > > your actual data.  Drill will then do projection pushdown, filter
> > > pushdown
> > > > and range pruning based on your query.
> > > >
> > > > On Wed, Dec 17, 2014 at 8:33 AM, Carol Bourgade <
> cjbourgade@gmail.com>
> > > > wrote:
> > > > >
> > > > > Implala documentation says for best performance use the string data
> > > type
> > > > > for HBase row keys.  I know that you do not have to define the data
> > > types
> > > > > for Drill queries , but do string bytes work better for drill
> queries
> > > on
> > > > > hbase row keys ?
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_hbase.html
> > > > > For best performance of Impala queries against HBase tables, most
> > > queries
> > > > > will perform comparisons in the WHERE against the column that
> > > corresponds
> > > > > to the HBase row key. When creating the table through the Hive
> shell,
> > > use
> > > > > the STRING data type for the column that corresponds to the HBase
> row
> > > > key.
> > > > > Impala can translate conditional tests (through operators such as
> =,
> > <,
> > > > > BETWEEN, and IN) against this column into fast lookups in HBase,
> but
> > > this
> > > > > optimization ("predicate pushdown") only works when that column is
> > > > defined
> > > > > as STRING.
> > > > >
> > > >
> > >
> >
>

Re: String for HBase row key

Posted by Ted Dunning <te...@gmail.com>.
What about filter pushdown in these cases?  I know that some filter ops
push down through convert calls.  What about through byte_substr?



On Tue, Jan 20, 2015 at 12:39 PM, Jacques Nadeau <ja...@apache.org> wrote:

> I believe there is byte_substr (or similar) which you could use before
> handing the value to convert_from
>
> On Tue, Jan 20, 2015 at 7:56 AM, Carol McDonald <cm...@maprtech.com>
> wrote:
>
> > what if the HBase primary key is a composite key  composed of multiple
> > types , for example  a string followed by a reverse timestamp (long)
> like
> > AMZN_9223370655563575807,
> >
> > are there parameters to specify the length in the function
> > convert_from(string
> > bytea, src_encoding name)
> >
> >
> >
> > On Thu, Dec 18, 2014 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> > wrote:
> >
> > > String keys work but aren't the most performant or appropriate encoding
> > to
> > > use in many cases.  Drill provides CONVERT_TO and CONVERT_FROM with a
> > large
> > > number of encodings (including those use by many Hadoop applications as
> > > well the Apache Phoenix project).  This improves performance of data
> use
> > in
> > > HBase.  You can use strings but you should use an encoding appropriate
> to
> > > your actual data.  Drill will then do projection pushdown, filter
> > pushdown
> > > and range pruning based on your query.
> > >
> > > On Wed, Dec 17, 2014 at 8:33 AM, Carol Bourgade <cj...@gmail.com>
> > > wrote:
> > > >
> > > > Implala documentation says for best performance use the string data
> > type
> > > > for HBase row keys.  I know that you do not have to define the data
> > types
> > > > for Drill queries , but do string bytes work better for drill queries
> > on
> > > > hbase row keys ?
> > > >
> > > >
> > > >
> > >
> >
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_hbase.html
> > > > For best performance of Impala queries against HBase tables, most
> > queries
> > > > will perform comparisons in the WHERE against the column that
> > corresponds
> > > > to the HBase row key. When creating the table through the Hive shell,
> > use
> > > > the STRING data type for the column that corresponds to the HBase row
> > > key.
> > > > Impala can translate conditional tests (through operators such as =,
> <,
> > > > BETWEEN, and IN) against this column into fast lookups in HBase, but
> > this
> > > > optimization ("predicate pushdown") only works when that column is
> > > defined
> > > > as STRING.
> > > >
> > >
> >
>

Re: String for HBase row key

Posted by Jacques Nadeau <ja...@apache.org>.
I believe there is byte_substr (or similar) which you could use before
handing the value to convert_from

On Tue, Jan 20, 2015 at 7:56 AM, Carol McDonald <cm...@maprtech.com>
wrote:

> what if the HBase primary key is a composite key  composed of multiple
> types , for example  a string followed by a reverse timestamp (long)  like
> AMZN_9223370655563575807,
>
> are there parameters to specify the length in the function
> convert_from(string
> bytea, src_encoding name)
>
>
>
> On Thu, Dec 18, 2014 at 12:22 AM, Jacques Nadeau <ja...@apache.org>
> wrote:
>
> > String keys work but aren't the most performant or appropriate encoding
> to
> > use in many cases.  Drill provides CONVERT_TO and CONVERT_FROM with a
> large
> > number of encodings (including those use by many Hadoop applications as
> > well the Apache Phoenix project).  This improves performance of data use
> in
> > HBase.  You can use strings but you should use an encoding appropriate to
> > your actual data.  Drill will then do projection pushdown, filter
> pushdown
> > and range pruning based on your query.
> >
> > On Wed, Dec 17, 2014 at 8:33 AM, Carol Bourgade <cj...@gmail.com>
> > wrote:
> > >
> > > Implala documentation says for best performance use the string data
> type
> > > for HBase row keys.  I know that you do not have to define the data
> types
> > > for Drill queries , but do string bytes work better for drill queries
> on
> > > hbase row keys ?
> > >
> > >
> > >
> >
> http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_hbase.html
> > > For best performance of Impala queries against HBase tables, most
> queries
> > > will perform comparisons in the WHERE against the column that
> corresponds
> > > to the HBase row key. When creating the table through the Hive shell,
> use
> > > the STRING data type for the column that corresponds to the HBase row
> > key.
> > > Impala can translate conditional tests (through operators such as =, <,
> > > BETWEEN, and IN) against this column into fast lookups in HBase, but
> this
> > > optimization ("predicate pushdown") only works when that column is
> > defined
> > > as STRING.
> > >
> >
>