You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Tim Patton <tp...@dealcatcher.com> on 2007/03/05 16:47:36 UTC
Re: Federated Search
Venkatesh Seetharam wrote:
> Hi Tim,
>
> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> working on a similar problem for searching a vault of over 100 million
> XML documents. I already have the encoding part done using Hadoop and
> Lucene. It works like a charm. I create N index partitions and have
> been trying to wrap Solr to search each partition, have a Search broker
> that merges the results and returns.
>
> I'm curious about how have you solved the distribution of additions,
> deletions and updates to each of the indexing servers.I use a
> partitioner based on a hash of the document id. Do you broadcast to the
> slaves as to who owns a document?
>
> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> <http://www.zeroc.com>) for distributing the search across these Solr
> servers. I'm not using HTTP.
>
> Any ideas are greatly appreciated.
>
> PS: I did subscribe to solr newsgroup now but did not receive a
> confirmation and hence sending it to you directly.
>
> --
> Thanks,
> Venkatesh
>
> "Perfection (in design) is achieved not when there is nothing more to
> add, but rather when there is nothing more to take away."
> - Antoine de Saint-Exupéry
I used a SQL database to keep track of which server had which document.
Then I originally used JMS and would use a selector for which server
number the document should go to. I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers. Additions are pretty much
assigned randomly to whichever server gets them first. At this point I
am up to around 20 million documents.
The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect. But I don't know how big the index will
grow and I wanted to be able to add servers at any point. I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.
How did you work around not being able to update a lucene index that is
stored in Hadoop? I know there were changes in Lucene 2.1 to support
this but I haven't looked that far into it yet, I've just been testing
the new IndexWriter. As an aside, I hope those features can be used by
Solr soon (if they aren't already in the nightlys).
Tim
Re: SPAM-LOW: Re: Federated Search
Posted by Tim Patton <tp...@dealcatcher.com>.
I have several indexes now (4 at the moment, 20gb each, and I want to be
able to drop in a new machine easily). I'm using SQL server as a DB and
it scales well. The DB doesn't get hit too hard, mostly doing location
lookups, and the app does some checking to make sure a document has
really changed before updating that back in the DB or the index. When a
new server is added it randomly picks up additions from the message
server (it's approximately round-robin) and the rest of the system
really doesn't even need to know about it.
I've realized partitioned indexing is a difficult, but solvable problem.
It could be a big project though. I mean we have all solved it in our
own way but no one has a general solution. Distributed searching might
be a better area to add to Solr since that should basically be the same
for everyone. I'm going to mess around with Jini on my own indexes,
there's finally a new book out to go with the newer versions.
How were you planning on using Solr with Hadoop? Maybe I don't fully
understand how hadoop works.
Tim
Venkatesh Seetharam wrote:
> Hi Tim,
>
> Thanks for your response. Interesting idea. Does the DB scale? Do you have
> one single index which you plan to use Solr for or you have multiple
> indexes?
>
>> But I don't know how big the index will grow and I wanted to be able to
> add servers at any point.
> I'm thinking of having N partitions with a max of 10 million documents per
> partition. Adding a server should not be a problem but the newly added
> server would take time to grow so that distribution of documents are equal
> in the cluster. I've tested with 50 million documents of 10 size each and
> looks very promising.
>
>> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.
> I'm infact looking around for a reverse-hash algorithm where in given a
> docId, I should be able to find which partition contains the document so I
> can save cycles on broadcasting slaves.
>
> I mean, even if you use a DB, how have you solved the problem of
> distribution when a new server is added into the mix.
>
> We have the same problem since we get daily updates to documents and
> document metadata.
>
>> How did you work around not being able to update a lucene index that is
> stored in Hadoop?
> I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
> and hence did not need any change to Lucene.
>
> I plan to index using Lucene/Hadoop and use Solr as the partition searcher
> and a broker which would merge the results and return 'em.
>
> Thanks,
> Venkatesh
>
> On 3/5/07, Tim Patton <tp...@dealcatcher.com> wrote:
>>
>>
>>
>> Venkatesh Seetharam wrote:
>> > Hi Tim,
>> >
>> > Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> > working on a similar problem for searching a vault of over 100 million
>> > XML documents. I already have the encoding part done using Hadoop and
>> > Lucene. It works like a charm. I create N index partitions and have
>> > been trying to wrap Solr to search each partition, have a Search broker
>> > that merges the results and returns.
>> >
>> > I'm curious about how have you solved the distribution of additions,
>> > deletions and updates to each of the indexing servers.I use a
>> > partitioner based on a hash of the document id. Do you broadcast to the
>> > slaves as to who owns a document?
>> >
>> > Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> > <http://www.zeroc.com>) for distributing the search across these Solr
>> > servers. I'm not using HTTP.
>> >
>> > Any ideas are greatly appreciated.
>> >
>> > PS: I did subscribe to solr newsgroup now but did not receive a
>> > confirmation and hence sending it to you directly.
>> >
>> > --
>> > Thanks,
>> > Venkatesh
>> >
>> > "Perfection (in design) is achieved not when there is nothing more to
>> > add, but rather when there is nothing more to take away."
>> > - Antoine de Saint-Exupéry
>>
>>
>> I used a SQL database to keep track of which server had which document.
>> Then I originally used JMS and would use a selector for which server
>> number the document should go to. I switched over to a home grown,
>> lightweight message server since JMS behaves really badly when it backs
>> up and I couldn't find a server that would simply pause the producers if
>> there was a problem with the consumers. Additions are pretty much
>> assigned randomly to whichever server gets them first. At this point I
>> am up to around 20 million documents.
>>
>> The hash idea sounds really interesting and if I had a fixed number of
>> indexes it would be perfect. But I don't know how big the index will
>> grow and I wanted to be able to add servers at any point. I would like
>> to eliminate any outside dependencies (SQL, JMS), which is why a
>> distributed Solr would let me focus on other areas.
>>
>> How did you work around not being able to update a lucene index that is
>> stored in Hadoop? I know there were changes in Lucene 2.1 to support
>> this but I haven't looked that far into it yet, I've just been testing
>> the new IndexWriter. As an aside, I hope those features can be used by
>> Solr soon (if they aren't already in the nightlys).
>>
>> Tim
>>
>>
>
Re: Federated Search
Posted by Venkatesh Seetharam <vs...@gmail.com>.
Hi Jed,
Thanks for sharing your thoughts and the link.
Venkatesh
On 3/11/07, Jed Reynolds <li...@benrey.is-a-geek.net> wrote:
>
> Venkatesh Seetharam wrote:
> >
> >> The hash idea sounds really interesting and if I had a fixed number of
> > indexes it would be perfect.
> > I'm infact looking around for a reverse-hash algorithm where in given a
> > docId, I should be able to find which partition contains the document
> > so I
> > can save cycles on broadcasting slaves.
>
> Many large databases partition their data either by load or by another
> logical manner, like by alphabet. I hear that Hotmail, for instance,
> partitions its users alphabetically. Having a broker will certainly
> abstract this mechninism, and of course your application(s) want to be
> able to bypass a broker when necessary.
>
> > I mean, even if you use a DB, how have you solved the problem of
> > distribution when a new server is added into the mix.
>
> http://www8.org/w8-papers/2a-webserver/caching/paper2.html
>
> I saw this link on the memcached list and the thread surrounding it
> certainly covered some similar ground. Some ideas have been discussed
> like:
> - high availability of memcached, redundant entries
> - scaling out clusters and facing the need to rebuild the entire cache
> on all nodes depending on your bucketing.
> I see some similarties with maintaining multiple indicies/lucene
> partitions and having a memcache deployment: mostly if you are hashing
> your keys to partitions (or buckets or machines) then you might be faced
> with a) availability issues if there's a machine/partition outtage b)
> rebuilding partitions if adding a partition/bucket changes the hash
> mapping.
>
> The ways I can think of to scale-out new indexes would be to have your
> application maintain two sets of bucket mappings for ids to indexes, and
> the second would be to key your documents and partition them by date.
> The former method would allow you to rebuild a second set of
> repartitioned indexes and buckets and allow you to update your
> application to use the new bucket mapping (when all the indexes has been
> rebuilt). The latter method would only apply if you could organize your
> document ids by date and only added new documents to the 'now' end or
> evenly across most dates. You'd have to add a new partition onto the end
> as time progressed, and rarely rebuild old indexes unless your documents
> grow unevenly.
>
> Interesting topic! I don't yet need to run multiple Lucene partitions,
> but I have a few memcached servers and increasing the number of them I
> expect will force my site to take a performance accordingly as I am
> forced to rebuild the caches. I can see similarly if I had multiple
> lucene partitions, that if I had to fission some of them, rebuilding the
> resulting partitions would be time intensive and I'd want to have
> procedures in place for availibility, scaling out and changing
> application code as necessary. Just having one fail-over Solr index is
> just so easy in comparison.
>
> Jed
>
Re: Federated Search
Posted by Jed Reynolds <li...@benrey.is-a-geek.net>.
Venkatesh Seetharam wrote:
>
>> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect.
> I'm infact looking around for a reverse-hash algorithm where in given a
> docId, I should be able to find which partition contains the document
> so I
> can save cycles on broadcasting slaves.
Many large databases partition their data either by load or by another
logical manner, like by alphabet. I hear that Hotmail, for instance,
partitions its users alphabetically. Having a broker will certainly
abstract this mechninism, and of course your application(s) want to be
able to bypass a broker when necessary.
> I mean, even if you use a DB, how have you solved the problem of
> distribution when a new server is added into the mix.
http://www8.org/w8-papers/2a-webserver/caching/paper2.html
I saw this link on the memcached list and the thread surrounding it
certainly covered some similar ground. Some ideas have been discussed like:
- high availability of memcached, redundant entries
- scaling out clusters and facing the need to rebuild the entire cache
on all nodes depending on your bucketing.
I see some similarties with maintaining multiple indicies/lucene
partitions and having a memcache deployment: mostly if you are hashing
your keys to partitions (or buckets or machines) then you might be faced
with a) availability issues if there's a machine/partition outtage b)
rebuilding partitions if adding a partition/bucket changes the hash mapping.
The ways I can think of to scale-out new indexes would be to have your
application maintain two sets of bucket mappings for ids to indexes, and
the second would be to key your documents and partition them by date.
The former method would allow you to rebuild a second set of
repartitioned indexes and buckets and allow you to update your
application to use the new bucket mapping (when all the indexes has been
rebuilt). The latter method would only apply if you could organize your
document ids by date and only added new documents to the 'now' end or
evenly across most dates. You'd have to add a new partition onto the end
as time progressed, and rarely rebuild old indexes unless your documents
grow unevenly.
Interesting topic! I don't yet need to run multiple Lucene partitions,
but I have a few memcached servers and increasing the number of them I
expect will force my site to take a performance accordingly as I am
forced to rebuild the caches. I can see similarly if I had multiple
lucene partitions, that if I had to fission some of them, rebuilding the
resulting partitions would be time intensive and I'd want to have
procedures in place for availibility, scaling out and changing
application code as necessary. Just having one fail-over Solr index is
just so easy in comparison.
Jed
Re: Federated Search
Posted by Venkatesh Seetharam <vs...@gmail.com>.
Hi Tim,
Thanks for your response. Interesting idea. Does the DB scale? Do you have
one single index which you plan to use Solr for or you have multiple
indexes?
> But I don't know how big the index will grow and I wanted to be able to
add servers at any point.
I'm thinking of having N partitions with a max of 10 million documents per
partition. Adding a server should not be a problem but the newly added
server would take time to grow so that distribution of documents are equal
in the cluster. I've tested with 50 million documents of 10 size each and
looks very promising.
> The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document so I
can save cycles on broadcasting slaves.
I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.
We have the same problem since we get daily updates to documents and
document metadata.
> How did you work around not being able to update a lucene index that is
stored in Hadoop?
I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
and hence did not need any change to Lucene.
I plan to index using Lucene/Hadoop and use Solr as the partition searcher
and a broker which would merge the results and return 'em.
Thanks,
Venkatesh
On 3/5/07, Tim Patton <tp...@dealcatcher.com> wrote:
>
>
>
> Venkatesh Seetharam wrote:
> > Hi Tim,
> >
> > Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> > working on a similar problem for searching a vault of over 100 million
> > XML documents. I already have the encoding part done using Hadoop and
> > Lucene. It works like a charm. I create N index partitions and have
> > been trying to wrap Solr to search each partition, have a Search broker
> > that merges the results and returns.
> >
> > I'm curious about how have you solved the distribution of additions,
> > deletions and updates to each of the indexing servers.I use a
> > partitioner based on a hash of the document id. Do you broadcast to the
> > slaves as to who owns a document?
> >
> > Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> > <http://www.zeroc.com>) for distributing the search across these Solr
> > servers. I'm not using HTTP.
> >
> > Any ideas are greatly appreciated.
> >
> > PS: I did subscribe to solr newsgroup now but did not receive a
> > confirmation and hence sending it to you directly.
> >
> > --
> > Thanks,
> > Venkatesh
> >
> > "Perfection (in design) is achieved not when there is nothing more to
> > add, but rather when there is nothing more to take away."
> > - Antoine de Saint-Exupéry
>
>
> I used a SQL database to keep track of which server had which document.
> Then I originally used JMS and would use a selector for which server
> number the document should go to. I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers. Additions are pretty much
> assigned randomly to whichever server gets them first. At this point I
> am up to around 20 million documents.
>
> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect. But I don't know how big the index will
> grow and I wanted to be able to add servers at any point. I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.
>
> How did you work around not being able to update a lucene index that is
> stored in Hadoop? I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter. As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).
>
> Tim
>
>
Re: Re[2]: Federated Search
Posted by Venkatesh Seetharam <vs...@gmail.com>.
Hi Jack,
Howdy. Comments are inline.
> is there any reason you don't want to use HTTP?
I've seen that Hadoop RPC is faster then HTTP. Also, Since Solr returns XML
response, you incur overhead in parsing that and then merging. I havent sone
scale testing with HTTP and XML response.
> Do the searchers need to know who has what document?
This is necessary if you are doing updates to the document in the index.
> I suppose solr is ok to handle 20 million document
Storage is not an issue. If the size of the index is huge, then it will take
time and when you want 100 searches/second, its really hard. I've read in
Lucene newsgroup that lucene works well with an index around 8-10GB. It
slows down when its bigger than that. Since my index can run into many GB,
I'd partition that.
> - If a hash value-based partitioning is used, re-indexing all the
document will be needed.
Why is that necessary? If a document has to be updated, you can broadcast to
slaves as to who owns it and then send an update to that slave.
Venkatesh
On 3/5/07, Jack L <jl...@yahoo.ca> wrote:
>
> This is very interesting discussion. I have a few question while
> reading Tim and Venkatesh's email:
>
> To Tim:
> 1. is there any reason you don't want to use HTTP? Since solr has
> an HTTP interface already, I suppose using HTTP is the simplest
> way to communicate the solr servers from the merger/search broker.
> hadoop and ice would both require some additional work - this is
> if you are using solr and not lucent directly.
>
> 2. "Do you broadcast to the slaves as to who owns a document?"
> Do the searchers need to know who has what document?
>
> To Venkatesh:
> 1. I suppose solr is ok to handle 20 million document - I hope I'm
> right because that's what I'm planning on doing :) Is it because
> of storage capacity why you you choose to use multiple solr
> servers?
>
> An open question: what's the best way to manage server addition?
> - If a hash value-based partitioning is used, re-indexing all
> the document will be needed.
> - Otherwise, a database seems to be required to track the documents.
>
> --
> Best regards,
> Jack
>
> Monday, March 5, 2007, 7:47:36 AM, you wrote:
>
>
>
> > Venkatesh Seetharam wrote:
> >> Hi Tim,
> >>
> >> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> >> working on a similar problem for searching a vault of over 100 million
> >> XML documents. I already have the encoding part done using Hadoop and
> >> Lucene. It works like a charm. I create N index partitions and have
> >> been trying to wrap Solr to search each partition, have a Search broker
> >> that merges the results and returns.
> >>
> >> I'm curious about how have you solved the distribution of additions,
> >> deletions and updates to each of the indexing servers.I use a
> >> partitioner based on a hash of the document id. Do you broadcast to the
>
> >> slaves as to who owns a document?
> >>
> >> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> >> <http://www.zeroc.com >) for distributing the search across these Solr
> >> servers. I'm not using HTTP.
> >>
> >> Any ideas are greatly appreciated.
> >>
> >> PS: I did subscribe to solr newsgroup now but did not receive a
> >> confirmation and hence sending it to you directly.
> >>
> >> --
> >> Thanks,
> >> Venkatesh
> >>
> >> "Perfection (in design) is achieved not when there is nothing more to
> >> add, but rather when there is nothing more to take away."
> >> - Antoine de Saint-Exupéry
>
>
> > I used a SQL database to keep track of which server had which document.
> > Then I originally used JMS and would use a selector for which server
>
> > number the document should go to. I switched over to a home grown,
> > lightweight message server since JMS behaves really badly when it backs
> > up and I couldn't find a server that would simply pause the producers if
>
> > there was a problem with the consumers. Additions are pretty much
> > assigned randomly to whichever server gets them first. At this point I
> > am up to around 20 million documents.
>
> > The hash idea sounds really interesting and if I had a fixed number of
> > indexes it would be perfect. But I don't know how big the index will
> > grow and I wanted to be able to add servers at any point. I would like
> > to eliminate any outside dependencies (SQL, JMS), which is why a
> > distributed Solr would let me focus on other areas.
>
> > How did you work around not being able to update a lucene index that is
> > stored in Hadoop? I know there were changes in Lucene 2.1 to support
> > this but I haven't looked that far into it yet, I've just been testing
> > the new IndexWriter. As an aside, I hope those features can be used by
> > Solr soon (if they aren't already in the nightlys).
>
> > Tim
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam protection around
> http://mail.yahoo.com
>
Re[2]: Federated Search
Posted by Tim Patton <tp...@dealcatcher.com>.
Jack L wrote:
> This is very interesting discussion. I have a few question while
> reading Tim and Venkatesh's email:
>
> To Tim:
> 1. is there any reason you don't want to use HTTP? Since solr has
> an HTTP interface already, I suppose using HTTP is the simplest
> way to communicate the solr servers from the merger/search broker.
> hadoop and ice would both require some additional work - this is
> if you are using solr and not lucent directly.
>
> 2. "Do you broadcast to the slaves as to who owns a document?"
> Do the searchers need to know who has what document?
>
> To Venkatesh:
> 1. I suppose solr is ok to handle 20 million document - I hope I'm
> right because that's what I'm planning on doing :) Is it because
> of storage capacity why you you choose to use multiple solr
> servers?
>
> An open question: what's the best way to manage server addition?
> - If a hash value-based partitioning is used, re-indexing all
> the document will be needed.
> - Otherwise, a database seems to be required to track the documents.
>
Jack,
My big stumbling blocks were with indexing more so than searching. I
did put together an RMI based system to search multiple lucene servers.
And the searchers don't need to know where everything is. However
with indexing at some point something needs to know where to send the
documents for updating or who to tell to delete a document, whether it
is the server that does the processing or some sort of broker. The
processing machines could do the DB look up and talk to Solr over HTTP
no problem and this is part of what I am considering doing. However I
have some extra code on the indexing machines to handle DB updates
etc..., though I might find a way to move this elsewhere in the system
so I can have pretty much a pure solr server with just a few custom
items (like my own Similarity or QueryParser).
I suppose the DB could be moved to lucene from SQL in the future as well.
Re[2]: Federated Search
Posted by Jack L <jl...@yahoo.ca>.
This is very interesting discussion. I have a few question while
reading Tim and Venkatesh's email:
To Tim:
1. is there any reason you don't want to use HTTP? Since solr has
an HTTP interface already, I suppose using HTTP is the simplest
way to communicate the solr servers from the merger/search broker.
hadoop and ice would both require some additional work - this is
if you are using solr and not lucent directly.
2. "Do you broadcast to the slaves as to who owns a document?"
Do the searchers need to know who has what document?
To Venkatesh:
1. I suppose solr is ok to handle 20 million document - I hope I'm
right because that's what I'm planning on doing :) Is it because
of storage capacity why you you choose to use multiple solr
servers?
An open question: what's the best way to manage server addition?
- If a hash value-based partitioning is used, re-indexing all
the document will be needed.
- Otherwise, a database seems to be required to track the documents.
--
Best regards,
Jack
Monday, March 5, 2007, 7:47:36 AM, you wrote:
> Venkatesh Seetharam wrote:
>> Hi Tim,
>>
>> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
>> working on a similar problem for searching a vault of over 100 million
>> XML documents. I already have the encoding part done using Hadoop and
>> Lucene. It works like a charm. I create N index partitions and have
>> been trying to wrap Solr to search each partition, have a Search broker
>> that merges the results and returns.
>>
>> I'm curious about how have you solved the distribution of additions,
>> deletions and updates to each of the indexing servers.I use a
>> partitioner based on a hash of the document id. Do you broadcast to the
>> slaves as to who owns a document?
>>
>> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
>> <http://www.zeroc.com>) for distributing the search across these Solr
>> servers. I'm not using HTTP.
>>
>> Any ideas are greatly appreciated.
>>
>> PS: I did subscribe to solr newsgroup now but did not receive a
>> confirmation and hence sending it to you directly.
>>
>> --
>> Thanks,
>> Venkatesh
>>
>> "Perfection (in design) is achieved not when there is nothing more to
>> add, but rather when there is nothing more to take away."
>> - Antoine de Saint-Exupéry
> I used a SQL database to keep track of which server had which document.
> Then I originally used JMS and would use a selector for which server
> number the document should go to. I switched over to a home grown,
> lightweight message server since JMS behaves really badly when it backs
> up and I couldn't find a server that would simply pause the producers if
> there was a problem with the consumers. Additions are pretty much
> assigned randomly to whichever server gets them first. At this point I
> am up to around 20 million documents.
> The hash idea sounds really interesting and if I had a fixed number of
> indexes it would be perfect. But I don't know how big the index will
> grow and I wanted to be able to add servers at any point. I would like
> to eliminate any outside dependencies (SQL, JMS), which is why a
> distributed Solr would let me focus on other areas.
> How did you work around not being able to update a lucene index that is
> stored in Hadoop? I know there were changes in Lucene 2.1 to support
> this but I haven't looked that far into it yet, I've just been testing
> the new IndexWriter. As an aside, I hope those features can be used by
> Solr soon (if they aren't already in the nightlys).
> Tim
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection around
http://mail.yahoo.com