You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by Stuti Awasthi <st...@hcl.com> on 2011/09/29 11:52:14 UTC

Hbase - Solr Integration

Hi Friends,

I am storing my data in Hbase. I want to do search using Solr. I can't find much documentation about the integration. Is there any documentation to integrate these two.

Please Suggest

Regards,
Stuti Awasthi

::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------

RE: Hbase - Solr Integration

Posted by Andrew Hu <an...@live.com>.
Thanks Drew for your suggestions and ideas, very helpful.

-Andrew

> Date: Fri, 30 Sep 2011 10:17:50 -0400
> Subject: Re: Hbase - Solr Integration
> From: drew.dahlke@bronto.com
> To: user@hbase.apache.org
> 
> Hi David,
> 
> I did a little proof of concept a few weeks ago indexing hundreds of
> millions of rows from hbase in solr using the near real time stuff in
> solr's trunk.
> 
> You *could* write map reduce jobs against hbase to generate lucene
> indexes on a periodic basis if you want, but that's not going to be
> real time in the least. If that interested you, take a peek at the
> source code for Katta.
> 
> Like you, I wanted updates to be indexed in near real time. At the
> time of writing, they haven't made a point release of Solr that
> includes the near real time code that came out of twitter. It's been
> merged into trunk and is actually quite stable. Check out trunk,
> compile it, and then configure the near real time stuff. They've
> introduced the concept of 'soft commits' which make new documents
> available to the index in near real time without all the overhead of
> flushing to disk (hard commit). In my case, I set it to automatically
> soft commit once a second and hard commit once an hour.
> 
> There's nothing hbase specific about my test. I just added some code
> to CC solr on writes I do to hbase using solr's rest api.
> 
> Each document in my test was quite small <1k. I had 1 ec2 large
> instance running solr and a hbase row scanner iterating over a table
> posting documents to solr as fast as it could. When the index was
> small, the indexing speed was a draw dropping 3500 document
> additions/sec. As the index grew to ~50million it had tapered off to
> 800/sec. The key to keeping things fast is to keep individual indexes
> small. Solr's answer to this is running multiple 'cores'. It's
> basically a rest api for sharding your solr index. Maybe you shard it
> 1 core per customer? When querying you can specify multiple cores to
> execute that query against, run multiple cores on a machine, etc.
> 
> I realize sharding solr to match the scalability of a distributed
> database probably doesn't sound very magical. It's a lot of legwork &
> that's exactly what's motivating projects like Elastic Search &
> Lucandra. I experimented with both and sadly those experiments went
> poorly compared to traditional solr.
> 
> Hope that helps,
> Drew
> 
> On Thu, Sep 29, 2011 at 6:37 PM, Andrew Hu <an...@live.com> wrote:
> >
> > Hi David,
> >
> > I am currently working with HBase with 100 columns. My requirement is
> > perform real time search on HBase using rowkeys, and these many columns (
> >  all within 1 family only in the schema). Typical query can be SQL type
> > with AND OR NOT operators using these columns. I have ruled out batch processing, such as
> > Hive. My question is:
> >
> > - HBase + Solr will probably give you
> > better query speed, but you need to maintain the both clusters, pushing
> > data from HBase to Solr, and perhaps update Solr index pretty frequently.
> > - Using HBase only and search needs to be
> > against all of these columns, you need to either build secondary indexes
> >  for each of the column ( if master table is 1 million rows, you will
> > end up with 100 millions row + 1 million of original master table,
> > which will use quite a lot of space), but I suppose search can be done
> > pretty fast as well ?
> >
> > Not sure what is the best approach, any suggestions ?
> >
> >
> > Thanks
> >
> > -Andrew
> >
> >> From: buttler1@llnl.gov
> >> To: user@hbase.apache.org
> >> Date: Thu, 29 Sep 2011 08:38:12 -0700
> >> Subject: RE: Hbase - Solr Integration
> >>
> >> It sounds like you should investigate the Lily Project.  They have already done a lot of work to integrate Solr and HBase into a single solution.  I did something similar before they released their project -- I like my use of dynamic schema's, but their overall approach is probably more solid.  In particular they have given careful consideration as to what to do with large objects, and how to integrate them into the system.  And most importantly, their project is open.
> >>
> >> There was also some talk earlier of integrating HBase and Solr -- you might want to search the list for some of Jason's posts.  I think that is a work in progress still.
> >>
> >> Otherwise you will have to roll your own solution.  It is actually not too difficult to set up a system to publish HBase contents to Solr.  The difficulty is in maintaining a consistent view of the data between the two.  I believe Lily uses queues to keep updates in sync.  If you can tolerate some delay, you could simply update your indexes on a regular basis, or set up your application to populate HBase and Solr simultaneously.  The biggest challenge is resharding.  HBase will automatically split regions when they become too large.  Solr doesn't have that capability yet, so you will have to manage the shards yourself.
> >>
> >> Another approach is to look at Elastic Search. That is a Lucene based system that does do automatic sharding.
> >>
> >> Direct search on HBase requires either a clever key encoding (like OpenTSDB), and/or multiple copies of the data to imitate secondary indexes.
> >>
> >> Dave
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
> >> Sent: Thursday, September 29, 2011 2:52 AM
> >> To: user@hbase.apache.org
> >> Subject: Hbase - Solr Integration
> >>
> >> Hi Friends,
> >>
> >> I am storing my data in Hbase. I want to do search using Solr. I can't find much documentation about the integration. Is there any documentation to integrate these two.
> >>
> >> Please Suggest
> >>
> >> Regards,
> >> Stuti Awasthi
> >>
> >> ::DISCLAIMER::
> >> -----------------------------------------------------------------------------------------------------------------------
> >>
> >> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> >> It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
> >> this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> >> Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
> >> this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
> >> received this email in error please delete it and notify the sender immediately. Before opening any mail and
> >> attachments please check them for viruses and defect.
> >>
> >> -----------------------------------------------------------------------------------------------------------------------
> >
 		 	   		  

Re: Hbase - Solr Integration

Posted by Drew Dahlke <dr...@bronto.com>.
Hi David,

I did a little proof of concept a few weeks ago indexing hundreds of
millions of rows from hbase in solr using the near real time stuff in
solr's trunk.

You *could* write map reduce jobs against hbase to generate lucene
indexes on a periodic basis if you want, but that's not going to be
real time in the least. If that interested you, take a peek at the
source code for Katta.

Like you, I wanted updates to be indexed in near real time. At the
time of writing, they haven't made a point release of Solr that
includes the near real time code that came out of twitter. It's been
merged into trunk and is actually quite stable. Check out trunk,
compile it, and then configure the near real time stuff. They've
introduced the concept of 'soft commits' which make new documents
available to the index in near real time without all the overhead of
flushing to disk (hard commit). In my case, I set it to automatically
soft commit once a second and hard commit once an hour.

There's nothing hbase specific about my test. I just added some code
to CC solr on writes I do to hbase using solr's rest api.

Each document in my test was quite small <1k. I had 1 ec2 large
instance running solr and a hbase row scanner iterating over a table
posting documents to solr as fast as it could. When the index was
small, the indexing speed was a draw dropping 3500 document
additions/sec. As the index grew to ~50million it had tapered off to
800/sec. The key to keeping things fast is to keep individual indexes
small. Solr's answer to this is running multiple 'cores'. It's
basically a rest api for sharding your solr index. Maybe you shard it
1 core per customer? When querying you can specify multiple cores to
execute that query against, run multiple cores on a machine, etc.

I realize sharding solr to match the scalability of a distributed
database probably doesn't sound very magical. It's a lot of legwork &
that's exactly what's motivating projects like Elastic Search &
Lucandra. I experimented with both and sadly those experiments went
poorly compared to traditional solr.

Hope that helps,
Drew

On Thu, Sep 29, 2011 at 6:37 PM, Andrew Hu <an...@live.com> wrote:
>
> Hi David,
>
> I am currently working with HBase with 100 columns. My requirement is
> perform real time search on HBase using rowkeys, and these many columns (
>  all within 1 family only in the schema). Typical query can be SQL type
> with AND OR NOT operators using these columns. I have ruled out batch processing, such as
> Hive. My question is:
>
> - HBase + Solr will probably give you
> better query speed, but you need to maintain the both clusters, pushing
> data from HBase to Solr, and perhaps update Solr index pretty frequently.
> - Using HBase only and search needs to be
> against all of these columns, you need to either build secondary indexes
>  for each of the column ( if master table is 1 million rows, you will
> end up with 100 millions row + 1 million of original master table,
> which will use quite a lot of space), but I suppose search can be done
> pretty fast as well ?
>
> Not sure what is the best approach, any suggestions ?
>
>
> Thanks
>
> -Andrew
>
>> From: buttler1@llnl.gov
>> To: user@hbase.apache.org
>> Date: Thu, 29 Sep 2011 08:38:12 -0700
>> Subject: RE: Hbase - Solr Integration
>>
>> It sounds like you should investigate the Lily Project.  They have already done a lot of work to integrate Solr and HBase into a single solution.  I did something similar before they released their project -- I like my use of dynamic schema's, but their overall approach is probably more solid.  In particular they have given careful consideration as to what to do with large objects, and how to integrate them into the system.  And most importantly, their project is open.
>>
>> There was also some talk earlier of integrating HBase and Solr -- you might want to search the list for some of Jason's posts.  I think that is a work in progress still.
>>
>> Otherwise you will have to roll your own solution.  It is actually not too difficult to set up a system to publish HBase contents to Solr.  The difficulty is in maintaining a consistent view of the data between the two.  I believe Lily uses queues to keep updates in sync.  If you can tolerate some delay, you could simply update your indexes on a regular basis, or set up your application to populate HBase and Solr simultaneously.  The biggest challenge is resharding.  HBase will automatically split regions when they become too large.  Solr doesn't have that capability yet, so you will have to manage the shards yourself.
>>
>> Another approach is to look at Elastic Search. That is a Lucene based system that does do automatic sharding.
>>
>> Direct search on HBase requires either a clever key encoding (like OpenTSDB), and/or multiple copies of the data to imitate secondary indexes.
>>
>> Dave
>>
>>
>>
>> -----Original Message-----
>> From: Stuti Awasthi [mailto:stutiawasthi@hcl.com]
>> Sent: Thursday, September 29, 2011 2:52 AM
>> To: user@hbase.apache.org
>> Subject: Hbase - Solr Integration
>>
>> Hi Friends,
>>
>> I am storing my data in Hbase. I want to do search using Solr. I can't find much documentation about the integration. Is there any documentation to integrate these two.
>>
>> Please Suggest
>>
>> Regards,
>> Stuti Awasthi
>>
>> ::DISCLAIMER::
>> -----------------------------------------------------------------------------------------------------------------------
>>
>> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
>> It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
>> this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
>> Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
>> this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
>> received this email in error please delete it and notify the sender immediately. Before opening any mail and
>> attachments please check them for viruses and defect.
>>
>> -----------------------------------------------------------------------------------------------------------------------
>

RE: Hbase - Solr Integration

Posted by Andrew Hu <an...@live.com>.
Hi David,

I am currently working with HBase with 100 columns. My requirement is 
perform real time search on HBase using rowkeys, and these many columns (
 all within 1 family only in the schema). Typical query can be SQL type 
with AND OR NOT operators using these columns. I have ruled out batch processing, such as 
Hive. My question is:

- HBase + Solr will probably give you 
better query speed, but you need to maintain the both clusters, pushing 
data from HBase to Solr, and perhaps update Solr index pretty frequently.
- Using HBase only and search needs to be 
against all of these columns, you need to either build secondary indexes
 for each of the column ( if master table is 1 million rows, you will 
end up with 100 millions row + 1 million of original master table,  
which will use quite a lot of space), but I suppose search can be done 
pretty fast as well ?

Not sure what is the best approach, any suggestions ?


Thanks

-Andrew

> From: buttler1@llnl.gov
> To: user@hbase.apache.org
> Date: Thu, 29 Sep 2011 08:38:12 -0700
> Subject: RE: Hbase - Solr Integration
> 
> It sounds like you should investigate the Lily Project.  They have already done a lot of work to integrate Solr and HBase into a single solution.  I did something similar before they released their project -- I like my use of dynamic schema's, but their overall approach is probably more solid.  In particular they have given careful consideration as to what to do with large objects, and how to integrate them into the system.  And most importantly, their project is open.
> 
> There was also some talk earlier of integrating HBase and Solr -- you might want to search the list for some of Jason's posts.  I think that is a work in progress still.
> 
> Otherwise you will have to roll your own solution.  It is actually not too difficult to set up a system to publish HBase contents to Solr.  The difficulty is in maintaining a consistent view of the data between the two.  I believe Lily uses queues to keep updates in sync.  If you can tolerate some delay, you could simply update your indexes on a regular basis, or set up your application to populate HBase and Solr simultaneously.  The biggest challenge is resharding.  HBase will automatically split regions when they become too large.  Solr doesn't have that capability yet, so you will have to manage the shards yourself.
> 
> Another approach is to look at Elastic Search. That is a Lucene based system that does do automatic sharding.
> 
> Direct search on HBase requires either a clever key encoding (like OpenTSDB), and/or multiple copies of the data to imitate secondary indexes.
> 
> Dave
> 
> 
> 
> -----Original Message-----
> From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
> Sent: Thursday, September 29, 2011 2:52 AM
> To: user@hbase.apache.org
> Subject: Hbase - Solr Integration
> 
> Hi Friends,
> 
> I am storing my data in Hbase. I want to do search using Solr. I can't find much documentation about the integration. Is there any documentation to integrate these two.
> 
> Please Suggest
> 
> Regards,
> Stuti Awasthi
> 
> ::DISCLAIMER::
> -----------------------------------------------------------------------------------------------------------------------
> 
> The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
> It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
> this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
> Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
> this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
> received this email in error please delete it and notify the sender immediately. Before opening any mail and
> attachments please check them for viruses and defect.
> 
> -----------------------------------------------------------------------------------------------------------------------
 		 	   		  

RE: Hbase - Solr Integration

Posted by "Buttler, David" <bu...@llnl.gov>.
It sounds like you should investigate the Lily Project.  They have already done a lot of work to integrate Solr and HBase into a single solution.  I did something similar before they released their project -- I like my use of dynamic schema's, but their overall approach is probably more solid.  In particular they have given careful consideration as to what to do with large objects, and how to integrate them into the system.  And most importantly, their project is open.

There was also some talk earlier of integrating HBase and Solr -- you might want to search the list for some of Jason's posts.  I think that is a work in progress still.

Otherwise you will have to roll your own solution.  It is actually not too difficult to set up a system to publish HBase contents to Solr.  The difficulty is in maintaining a consistent view of the data between the two.  I believe Lily uses queues to keep updates in sync.  If you can tolerate some delay, you could simply update your indexes on a regular basis, or set up your application to populate HBase and Solr simultaneously.  The biggest challenge is resharding.  HBase will automatically split regions when they become too large.  Solr doesn't have that capability yet, so you will have to manage the shards yourself.

Another approach is to look at Elastic Search. That is a Lucene based system that does do automatic sharding.

Direct search on HBase requires either a clever key encoding (like OpenTSDB), and/or multiple copies of the data to imitate secondary indexes.

Dave



-----Original Message-----
From: Stuti Awasthi [mailto:stutiawasthi@hcl.com] 
Sent: Thursday, September 29, 2011 2:52 AM
To: user@hbase.apache.org
Subject: Hbase - Solr Integration

Hi Friends,

I am storing my data in Hbase. I want to do search using Solr. I can't find much documentation about the integration. Is there any documentation to integrate these two.

Please Suggest

Regards,
Stuti Awasthi

::DISCLAIMER::
-----------------------------------------------------------------------------------------------------------------------

The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only.
It shall not attach any liability on the originator or HCL or its affiliates. Any views or opinions presented in
this email are solely those of the author and may not necessarily reflect the opinions of HCL or its affiliates.
Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of
this message without the prior written consent of the author of this e-mail is strictly prohibited. If you have
received this email in error please delete it and notify the sender immediately. Before opening any mail and
attachments please check them for viruses and defect.

-----------------------------------------------------------------------------------------------------------------------