You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Bin Wang <bi...@gmail.com> on 2016/05/11 21:21:53 UTC

Able to search by all the columns and faster than impala

Hi there,

I have a use case here where I have a table that have low billions of rows
and less than 50 columns. This is a very popular data sources where there
is a huge demand internally people want to query the table.
Nothing more complex than "select * from where .. and .."

However, not everyone is able to write SQL and we are thinking about
building this into a web application. However, we usually use the tool Hive
and Impala and neither of them is fast enough to be the backend of a web
application. (think about what happens if you click a button and it
returned the result back to you in minutes... annoying!)

We tried to dump the data into Solrcloud and it worked really well if the
number of returned results is small.. but the performance is really back
even when the row count is a few thousand...

We also tried to use HBase where it is really hard to design a schema since
a full table scan takes lots of time anyway.

Then I have an idea that we can write a map reduce to turn the data into a
melted <http://www.statmethods.net/management/reshape.html> format. And
then we group them by column_name and cell_value, and build a list of
row-id where this value appear, sort of the inverted index of Lucene. so we
have two tables in HBase.

So when user issues a query maybe where 3 where conditions, it will first
quickly go the the inverted index to search for what are the row ids that
meet each condition, then we find the intersection for those three lists.
Now we have a list of all the row id where meet the three criteria. In the
end, we look up those row id in another HBase table to retrieve the rows.

Can anyone shed some lights on this?

Best regards,

Bin

RE: Able to search by all the columns and faster than impala

Posted by Dave Birdsall <da...@esgyn.com>.
Hi,

If your SQL-on-Hadoop solution supports secondary indexes, you can simply
create those on the popular columns to speed up query time.

Dave

-----Original Message-----
From: Bin Wang [mailto:binwang.cu@gmail.com]
Sent: Wednesday, May 11, 2016 2:22 PM
To: user@hbase.apache.org
Subject: Able to search by all the columns and faster than impala

Hi there,

I have a use case here where I have a table that have low billions of rows
and less than 50 columns. This is a very popular data sources where there is
a huge demand internally people want to query the table.
Nothing more complex than "select * from where .. and .."

However, not everyone is able to write SQL and we are thinking about
building this into a web application. However, we usually use the tool Hive
and Impala and neither of them is fast enough to be the backend of a web
application. (think about what happens if you click a button and it returned
the result back to you in minutes... annoying!)

We tried to dump the data into Solrcloud and it worked really well if the
number of returned results is small.. but the performance is really back
even when the row count is a few thousand...

We also tried to use HBase where it is really hard to design a schema since
a full table scan takes lots of time anyway.

Then I have an idea that we can write a map reduce to turn the data into a
melted <http://www.statmethods.net/management/reshape.html> format. And then
we group them by column_name and cell_value, and build a list of row-id
where this value appear, sort of the inverted index of Lucene. so we have
two tables in HBase.

So when user issues a query maybe where 3 where conditions, it will first
quickly go the the inverted index to search for what are the row ids that
meet each condition, then we find the intersection for those three lists.
Now we have a list of all the row id where meet the three criteria. In the
end, we look up those row id in another HBase table to retrieve the rows.

Can anyone shed some lights on this?

Best regards,

Bin