You are viewing a plain text version of this content. The canonical link for it is here.

Posted to solr-user@lucene.apache.org by Zhenyu Zhong <zh...@gmail.com> on 2009/09/02 19:14:50 UTC

questions about solr

Dear all,

I am very interested in Solr and would like to deploy Solr for distributed
indexing and searching. I hope you are the right Solr expert who can help me
out.
However, I have concerns about the scalability and management overhead of
Solr. I am wondering if anyone could give me some guidance on Solr.

Basically, I have the following questions,
For indexing
1.  How does Solr handle the distributed indexing? It seems Solr generates
index on a single box. What if the index is huge and can't sit on one box?
2.  Is it possible for Solr to generate index in HDFS?

For searching
3.  Solr provides Master/Slave framework. How does the Solr distribute the
search? Does Solr know which index/shard to deliver the query to? Or does it
have to do a multicast query to all the nodes?

For fault tolerance
4. Does Solr handle the management overhead automatically? suppose master
goes down, how does Solr recover the master in order to get the latest index
updates?
    Do we have to code ourselves to handle this?
5. Suppose master goes down immediately after the index updates, while the
updates haven't been replicated to the slaves, data loss seems to happen.
Does Solr have any mechanism to deal with that?

Performance of real-time index updating
6. How is the performance of this realtime index updating? Suppose we are
updating a million records for a huge index with billions of records
frequently. Can Solr provides a reasonable performance and low latency on
that? (Probably it is related to Lucene library)




I would be very appreciated if you can give us some guidance.

Best,
edward

Re: questions about solr

Posted by Shalin Shekhar Mangar <sh...@gmail.com>.

On Wed, Sep 2, 2009 at 10:44 PM, Zhenyu Zhong <zh...@gmail.com>wrote:

> Dear all,
>
> I am very interested in Solr and would like to deploy Solr for distributed
> indexing and searching. I hope you are the right Solr expert who can help
> me
> out.
> However, I have concerns about the scalability and management overhead of
> Solr. I am wondering if anyone could give me some guidance on Solr.
>
> Basically, I have the following questions,
> For indexing
> 1.  How does Solr handle the distributed indexing? It seems Solr generates
> index on a single box. What if the index is huge and can't sit on one box?
>

Solr leaves the distribution of index upto the user. So if you think your
index will not fit in one box, you figure out a sharding strategy (such as
hashing or round-robin) and index your collection into each shards.

Solr supports distributed search so that your query can use all the shards
to give you the results.


> 2.  Is it possible for Solr to generate index in HDFS?
>
>
Never tried but it seems so. See Jason's response and the Jira issue he has
mentioned.


> For searching
> 3.  Solr provides Master/Slave framework. How does the Solr distribute the
> search? Does Solr know which index/shard to deliver the query to? Or does
> it
> have to do a multicast query to all the nodes?
>
>
For a full-text search it is hard to figure out the correct shards because
matching document could be living anywhere (unless you shard in a very
clever way and your data can be sharded in that way). Each shard is queried,
the results are merged and returned as if you had queried a single Solr
server.


> For fault tolerance
> 4. Does Solr handle the management overhead automatically? suppose master
> goes down, how does Solr recover the master in order to get the latest
> index
> updates?

   Do we have to code ourselves to handle this?
>

It does not. You have to handle that yourself currently. Similar topics have
been discussed on this list in the past and some workarounds have been
suggested. I suggest you search the archives.


> 5. Suppose master goes down immediately after the index updates, while the
> updates haven't been replicated to the slaves, data loss seems to happen.
> Does Solr have any mechanism to deal with that?
>
>
No. If you want you can setup a backup master and index on both master and
backup machines to achieve redundancy. However switching between the master
and the backup would need to be done by you.


> Performance of real-time index updating
> 6. How is the performance of this realtime index updating? Suppose we are
> updating a million records for a huge index with billions of records
> frequently. Can Solr provides a reasonable performance and low latency on
> that? (Probably it is related to Lucene library)
>
>
How frequently? With careful sharding, you can distribute your write load.
Depending on your data, you may also be able to split you indexes into a
more frequently updated on and an older archive index.

A lot of work is in progress in this area. Lucene 2.9 has support for near
real time search with more improvements planned in the coming days. Solr 1.4
will not have support for these new Lucene features but with 1.5 things
should be a lot better.

-- 
Regards,
Shalin Shekhar Mangar.

Re: questions about solr

Posted by Jason Rutherglen <ja...@gmail.com>.

For HDFS, failover, sharding you may want to use Solr with Katta.
There's an issue open at:
http://issues.apache.org/jira/browse/SOLR-1301

Near realtime search needs to be added incrementally to Solr.  Today I
wouldn't recommend it.

On Wed, Sep 2, 2009 at 10:14 AM, Zhenyu Zhong<zh...@gmail.com> wrote:
> Dear all,
>
> I am very interested in Solr and would like to deploy Solr for distributed
> indexing and searching. I hope you are the right Solr expert who can help me
> out.
> However, I have concerns about the scalability and management overhead of
> Solr. I am wondering if anyone could give me some guidance on Solr.
>
> Basically, I have the following questions,
> For indexing
> 1.  How does Solr handle the distributed indexing? It seems Solr generates
> index on a single box. What if the index is huge and can't sit on one box?
> 2.  Is it possible for Solr to generate index in HDFS?
>
> For searching
> 3.  Solr provides Master/Slave framework. How does the Solr distribute the
> search? Does Solr know which index/shard to deliver the query to? Or does it
> have to do a multicast query to all the nodes?
>
> For fault tolerance
> 4. Does Solr handle the management overhead automatically? suppose master
> goes down, how does Solr recover the master in order to get the latest index
> updates?
>    Do we have to code ourselves to handle this?
> 5. Suppose master goes down immediately after the index updates, while the
> updates haven't been replicated to the slaves, data loss seems to happen.
> Does Solr have any mechanism to deal with that?
>
> Performance of real-time index updating
> 6. How is the performance of this realtime index updating? Suppose we are
> updating a million records for a huge index with billions of records
> frequently. Can Solr provides a reasonable performance and low latency on
> that? (Probably it is related to Lucene library)
>
>
>
>
> I would be very appreciated if you can give us some guidance.
>
> Best,
> edward
>