You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Shengjie Min <ke...@gmail.com> on 2012/09/10 16:24:58 UTC

Hbase filter-SubstringComparator vs full text search indexing

In my case, I have all the log events stored in HDFS/hbase in this format:

timestamp | priority | category | message body

Given I have only 4 fields here, that limits my queries to only against
these four. I am thinking about more advanced search like full text search
the message body. well, mainly substring query against message body.

   1.

   Has anybody tried to use Hbase SubstringComparator? How does it perform,
   with reasonable huge amount of data, can it still provide us the real time
   response capability?
   2.

   In my case, does it make more sene to use a proper full text search
   engine(lucene/solr/elasticsearch) to index the message body, does that
   sound like a better idea?

would be great someone experienced can share some stories here.

-Shengjie Min

Re: Hbase filter-SubstringComparator vs full text search indexing

Posted by Jacques <wh...@gmail.com>.
Two cents below...

On Mon, Sep 10, 2012 at 7:24 AM, Shengjie Min <ke...@gmail.com> wrote:

> In my case, I have all the log events stored in HDFS/hbase in this format:
>
> timestamp | priority | category | message body
>
> Given I have only 4 fields here, that limits my queries to only against
> these four. I am thinking about more advanced search like full text search
> the message body. well, mainly substring query against message body.
>
>    1.
>
>    Has anybody tried to use Hbase SubstringComparator? How does it perform,
>    with reasonable huge amount of data, can it still provide us the real
> time
>    response capability?
>

Probably not if "huge" is sufficiently large.  Since HBase only stores data
indexed by the primary row key, any other criteria search requires a full
scan of all data.


>    2.
>
>    In my case, does it make more sene to use a proper full text search
>    engine(lucene/solr/elasticsearch) to index the message body, does that
>    sound like a better idea?
>

Often yes.  For big data especially, this is where ElasticSearch accels.



>
> would be great someone experienced can share some stories here.
>
> -Shengjie Min
>

Re: Hbase filter-SubstringComparator vs full text search indexing

Posted by Otis Gospodnetic <ot...@gmail.com>.
Hello,

If you need to scan lots of log messages and process them use HBase
(or Hive or Pig or simply HDFS+MR)
If you need to query your data set by anything in the text of the log
message, use ElasticSearch or Solr 4.0 or Sensei or just Lucene.

Otis
-- 
Search Analytics - http://sematext.com/search-analytics/index.html
Performance Monitoring - http://sematext.com/spm/index.html


On Mon, Sep 10, 2012 at 10:24 AM, Shengjie Min <ke...@gmail.com> wrote:
> In my case, I have all the log events stored in HDFS/hbase in this format:
>
> timestamp | priority | category | message body
>
> Given I have only 4 fields here, that limits my queries to only against
> these four. I am thinking about more advanced search like full text search
> the message body. well, mainly substring query against message body.
>
>    1.
>
>    Has anybody tried to use Hbase SubstringComparator? How does it perform,
>    with reasonable huge amount of data, can it still provide us the real time
>    response capability?
>    2.
>
>    In my case, does it make more sene to use a proper full text search
>    engine(lucene/solr/elasticsearch) to index the message body, does that
>    sound like a better idea?
>
> would be great someone experienced can share some stories here.
>
> -Shengjie Min