You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by prasenjit mukherjee <pr...@gmail.com> on 2010/03/16 07:17:20 UTC

How to avoid a full table scan for column search. ( HIVE+LUCENE)

Is there a way to avoid full table scan for an arbitrary where-clause
usage ? partitioning/bucketing makes sense only when you know which
columns will be searched upon. I was wondering if there is any project
which combines the SQL-like features of HIVE and inverted-index like
search-features of LUCENE, and works on cloud. Guess I am asking for
too much :(

I have been using oracle till now and my usage is mainly restricted to
do summation-type queries with some where clause, example being  :
"Select SUM(column1)  where col2='foo' AND col3='bar'".
The output is always some aggregation and where clauses can include
"<, >, =, IN".  I would like to use some kind of distributed
processing to speed up the table generation, search query-time. Hive (
and to some extent Pig ) seems to be the closest tool available to
what I am looking for. I am also exploring hbase, but not sure whether
it will be the right choice for my problem.

Hive can definitely help in parallelizing up the search-processing.
But my main concern is whether hive does ( or plans to do ) any
storage optimization like oracle,lucene ( apart from simple
partitioning/bucketing ).  It seems that all the hadoop-options (
hive,pig,hbase) will have to do  an entire table scan.

Appreciate any suggestions/feedback..

-thanks,
Prasen

Re: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by Tim Robertson <ti...@gmail.com>.

I believe it is pretty common practice to build Lucene indexes from HDFS
content (e.g. tab files), or HBase tables.

I have been exploring Lucene indexes to Columns held in HBase, where the
Lucene index indexes fields, but stores only the Key to the HBase row.
More recently I have been using Hive for report running.

I anticipate in future our architecture looking like:
- harvesters (crawlers) / loaders to HBase
- Hive on HBase tables for ad-hoc reporting (possibly using secondary
indexes somehow to reduce to less than full table scan for common reports)
- Lucene for real time search

HBase has a couple of indexing projects.  One uses secondary tables, and
another I see is going for a Lucene index.

We are progressing off Mysql for everything with tables of 200million+
records
(Hive is really performing fantastically for building DBs offline with
200million to 20 million joins - the results still loaded into Mysql tho'
for now)

Cheers,
Tim


On Tue, Mar 16, 2010 at 3:12 PM, Edward Capriolo <ed...@gmail.com>wrote:

>
>
> On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee <prasen.bea@gmail.com
> > wrote:
>
>> forwarding to hdfs and pig mailing-lists for  responses from wider
>> audience.
>>
>>
>> ---------- Forwarded message ----------
>> From: prasenjit mukherjee <pr...@gmail.com>
>> Date: Tue, Mar 16, 2010 at 11:47 AM
>> Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
>> To: hive-user <hi...@hadoop.apache.org>
>>
>>
>> Is there a way to avoid full table scan for an arbitrary where-clause
>> usage ? partitioning/bucketing makes sense only when you know which
>> columns will be searched upon. I was wondering if there is any project
>> which combines the SQL-like features of HIVE and inverted-index like
>> search-features of LUCENE, and works on cloud. Guess I am asking for
>> too much :(
>>
>> I have been using oracle till now and my usage is mainly restricted to
>> do summation-type queries with some where clause, example being  :
>> "Select SUM(column1)  where col2='foo' AND col3='bar'".
>> The output is always some aggregation and where clauses can include
>> "<, >, =, IN".  I would like to use some kind of distributed
>> processing to speed up the table generation, search query-time. Hive (
>> and to some extent Pig ) seems to be the closest tool available to
>> what I am looking for. I am also exploring hbase, but not sure whether
>> it will be the right choice for my problem.
>>
>> Hive can definitely help in parallelizing up the search-processing.
>> But my main concern is whether hive does ( or plans to do ) any
>> storage optimization like oracle,lucene ( apart from simple
>> partitioning/bucketing ).  It seems that all the hadoop-options (
>> hive,pig,hbase) will have to do  an entire table scan.
>>
>> Appreciate any suggestions/feedback..
>>
>> -thanks,
>> Prasen
>>
>
> There is a Jira open with hive to build indexes that will optimize some
> types of queries.
>
> https://cwiki.apache.org/jira/browse/HIVE-417
>

Re: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> forwarding to hdfs and pig mailing-lists for  responses from wider
> audience.
>
>
> ---------- Forwarded message ----------
> From: prasenjit mukherjee <pr...@gmail.com>
> Date: Tue, Mar 16, 2010 at 11:47 AM
> Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
> To: hive-user <hi...@hadoop.apache.org>
>
>
> Is there a way to avoid full table scan for an arbitrary where-clause
> usage ? partitioning/bucketing makes sense only when you know which
> columns will be searched upon. I was wondering if there is any project
> which combines the SQL-like features of HIVE and inverted-index like
> search-features of LUCENE, and works on cloud. Guess I am asking for
> too much :(
>
> I have been using oracle till now and my usage is mainly restricted to
> do summation-type queries with some where clause, example being  :
> "Select SUM(column1)  where col2='foo' AND col3='bar'".
> The output is always some aggregation and where clauses can include
> "<, >, =, IN".  I would like to use some kind of distributed
> processing to speed up the table generation, search query-time. Hive (
> and to some extent Pig ) seems to be the closest tool available to
> what I am looking for. I am also exploring hbase, but not sure whether
> it will be the right choice for my problem.
>
> Hive can definitely help in parallelizing up the search-processing.
> But my main concern is whether hive does ( or plans to do ) any
> storage optimization like oracle,lucene ( apart from simple
> partitioning/bucketing ).  It seems that all the hadoop-options (
> hive,pig,hbase) will have to do  an entire table scan.
>
> Appreciate any suggestions/feedback..
>
> -thanks,
> Prasen
>

There is a Jira open with hive to build indexes that will optimize some
types of queries.

https://cwiki.apache.org/jira/browse/HIVE-417

Re: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> forwarding to hdfs and pig mailing-lists for  responses from wider
> audience.
>
>
> ---------- Forwarded message ----------
> From: prasenjit mukherjee <pr...@gmail.com>
> Date: Tue, Mar 16, 2010 at 11:47 AM
> Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
> To: hive-user <hi...@hadoop.apache.org>
>
>
> Is there a way to avoid full table scan for an arbitrary where-clause
> usage ? partitioning/bucketing makes sense only when you know which
> columns will be searched upon. I was wondering if there is any project
> which combines the SQL-like features of HIVE and inverted-index like
> search-features of LUCENE, and works on cloud. Guess I am asking for
> too much :(
>
> I have been using oracle till now and my usage is mainly restricted to
> do summation-type queries with some where clause, example being  :
> "Select SUM(column1)  where col2='foo' AND col3='bar'".
> The output is always some aggregation and where clauses can include
> "<, >, =, IN".  I would like to use some kind of distributed
> processing to speed up the table generation, search query-time. Hive (
> and to some extent Pig ) seems to be the closest tool available to
> what I am looking for. I am also exploring hbase, but not sure whether
> it will be the right choice for my problem.
>
> Hive can definitely help in parallelizing up the search-processing.
> But my main concern is whether hive does ( or plans to do ) any
> storage optimization like oracle,lucene ( apart from simple
> partitioning/bucketing ).  It seems that all the hadoop-options (
> hive,pig,hbase) will have to do  an entire table scan.
>
> Appreciate any suggestions/feedback..
>
> -thanks,
> Prasen
>

There is a Jira open with hive to build indexes that will optimize some
types of queries.

https://cwiki.apache.org/jira/browse/HIVE-417

Re: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by Edward Capriolo <ed...@gmail.com>.

On Tue, Mar 16, 2010 at 8:36 AM, prasenjit mukherjee
<pr...@gmail.com>wrote:

> forwarding to hdfs and pig mailing-lists for  responses from wider
> audience.
>
>
> ---------- Forwarded message ----------
> From: prasenjit mukherjee <pr...@gmail.com>
> Date: Tue, Mar 16, 2010 at 11:47 AM
> Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
> To: hive-user <hi...@hadoop.apache.org>
>
>
> Is there a way to avoid full table scan for an arbitrary where-clause
> usage ? partitioning/bucketing makes sense only when you know which
> columns will be searched upon. I was wondering if there is any project
> which combines the SQL-like features of HIVE and inverted-index like
> search-features of LUCENE, and works on cloud. Guess I am asking for
> too much :(
>
> I have been using oracle till now and my usage is mainly restricted to
> do summation-type queries with some where clause, example being  :
> "Select SUM(column1)  where col2='foo' AND col3='bar'".
> The output is always some aggregation and where clauses can include
> "<, >, =, IN".  I would like to use some kind of distributed
> processing to speed up the table generation, search query-time. Hive (
> and to some extent Pig ) seems to be the closest tool available to
> what I am looking for. I am also exploring hbase, but not sure whether
> it will be the right choice for my problem.
>
> Hive can definitely help in parallelizing up the search-processing.
> But my main concern is whether hive does ( or plans to do ) any
> storage optimization like oracle,lucene ( apart from simple
> partitioning/bucketing ).  It seems that all the hadoop-options (
> hive,pig,hbase) will have to do  an entire table scan.
>
> Appreciate any suggestions/feedback..
>
> -thanks,
> Prasen
>

There is a Jira open with hive to build indexes that will optimize some
types of queries.

https://cwiki.apache.org/jira/browse/HIVE-417

Fwd: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by prasenjit mukherjee <pr...@gmail.com>.

forwarding to hdfs and pig mailing-lists for  responses from wider audience.

---------- Forwarded message ----------
From: prasenjit mukherjee <pr...@gmail.com>
Date: Tue, Mar 16, 2010 at 11:47 AM
Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
To: hive-user <hi...@hadoop.apache.org>

Is there a way to avoid full table scan for an arbitrary where-clause
usage ? partitioning/bucketing makes sense only when you know which
columns will be searched upon. I was wondering if there is any project
which combines the SQL-like features of HIVE and inverted-index like
search-features of LUCENE, and works on cloud. Guess I am asking for
too much :(

I have been using oracle till now and my usage is mainly restricted to
do summation-type queries with some where clause, example being  :
"Select SUM(column1)  where col2='foo' AND col3='bar'".
The output is always some aggregation and where clauses can include
"<, >, =, IN".  I would like to use some kind of distributed
processing to speed up the table generation, search query-time. Hive (
and to some extent Pig ) seems to be the closest tool available to
what I am looking for. I am also exploring hbase, but not sure whether
it will be the right choice for my problem.

Hive can definitely help in parallelizing up the search-processing.
But my main concern is whether hive does ( or plans to do ) any
storage optimization like oracle,lucene ( apart from simple
partitioning/bucketing ).  It seems that all the hadoop-options (
hive,pig,hbase) will have to do  an entire table scan.

Appreciate any suggestions/feedback..

-thanks,
Prasen

Fwd: How to avoid a full table scan for column search. ( HIVE+LUCENE)

Posted by prasenjit mukherjee <pr...@gmail.com>.

forwarding to hdfs and pig mailing-lists for  responses from wider audience.

---------- Forwarded message ----------
From: prasenjit mukherjee <pr...@gmail.com>
Date: Tue, Mar 16, 2010 at 11:47 AM
Subject: How to avoid a full table scan for column search. ( HIVE+LUCENE)
To: hive-user <hi...@hadoop.apache.org>

Is there a way to avoid full table scan for an arbitrary where-clause
usage ? partitioning/bucketing makes sense only when you know which
columns will be searched upon. I was wondering if there is any project
which combines the SQL-like features of HIVE and inverted-index like
search-features of LUCENE, and works on cloud. Guess I am asking for
too much :(

I have been using oracle till now and my usage is mainly restricted to
do summation-type queries with some where clause, example being  :
"Select SUM(column1)  where col2='foo' AND col3='bar'".
The output is always some aggregation and where clauses can include
"<, >, =, IN".  I would like to use some kind of distributed
processing to speed up the table generation, search query-time. Hive (
and to some extent Pig ) seems to be the closest tool available to
what I am looking for. I am also exploring hbase, but not sure whether
it will be the right choice for my problem.

Hive can definitely help in parallelizing up the search-processing.
But my main concern is whether hive does ( or plans to do ) any
storage optimization like oracle,lucene ( apart from simple
partitioning/bucketing ).  It seems that all the hadoop-options (
hive,pig,hbase) will have to do  an entire table scan.

Appreciate any suggestions/feedback..

-thanks,
Prasen