You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hbase.apache.org by James Pettyjohn <ja...@scientology.net> on 2013/12/03 02:21:55 UTC

Makes search indexes

Hi, general strategy and schemata approach question.

I've got a lot of different data in a relational db I'm trying to make
searchable. One thing for example is searching for people by email
address. I have 6 tables that might be, 10s of millions of records
and none of it standardized. So it's mixed case and may have multiple
emails in one field or something which isn't an email address at all.

To do that as a one off isn't too bad but the data will be added to,
and PKs will get phased out and split into multiple PKs etc. Also I
want this on a number of other fields too that will need different
transformations applied to the data and come from their own set of
tables.

I could do this a number of ways but I'm not satisfied with any of them
and I don't think that such a generic proposition has no tools already
somewhat suited for this task.

The best tools for this may not be HBase but I'd like to
put my HBase cluster to work on this and have it available to
MR jobs.

Best, James

Re: Makes search indexes

Posted by Wukang Lin <vb...@gmail.com>.

Hi James,
  it seems a problem of search for non-standardized documents, I think solr
(or some like this) may meet your requires.
  good luck.


2013/12/3 James Pettyjohn <ja...@scientology.net>

> Hi, general strategy and schemata approach question.
>
> I've got a lot of different data in a relational db I'm trying to make
> searchable. One thing for example is searching for people by email
> address. I have 6 tables that might be, 10s of millions of records
> and none of it standardized. So it's mixed case and may have multiple
> emails in one field or something which isn't an email address at all.
>
> To do that as a one off isn't too bad but the data will be added to,
> and PKs will get phased out and split into multiple PKs etc. Also I
> want this on a number of other fields too that will need different
> transformations applied to the data and come from their own set of
> tables.
>
> I could do this a number of ways but I'm not satisfied with any of them
> and I don't think that such a generic proposition has no tools already
> somewhat suited for this task.
>
> The best tools for this may not be HBase but I'd like to
> put my HBase cluster to work on this and have it available to
> MR jobs.
>
> Best, James
>