You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Development Team <de...@gmail.com> on 2009/04/14 21:09:59 UTC

How to manage real-time (presence) data in a large index?

Hi everybody,
       I have a relatively large index (it will eventually contain ~4M
documents and be about 3G in size, I think) that indexes user data,
settings, and the like. The documents represent a community of users
whereupon a subset of them may be "online" at any time. Also, we want to
score our search results across searches that span the whole index by the
online (i.e. presence) status.
       Right now the list of online members is kept in a database table,
however we very often need to search on these users. The problem is, we're
using Solr for our searches and we don't know how to approach setting up a
search system for a large amount of highly volatile data.
       How do people typically go about this? Do they do one of the
following:
             1) Set up a second core and keep only index the "online"
members in there? (Then we could not score normal search results by online
status.)
             2) Index the online status in our regular solr index and not
worry about it? (If it's fast to update docs in a large index, then why not
maintain real-time data in the main index?)
             3) Just use a database for the presence data and forget about
using Solr for the presence-related searches?
       Is there anything in Solr that I should be looking into to help with
this problem? I'd appreciate any help.

Sincerely,

       Daryl.

Re: How to manage real-time (presence) data in a large index?

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@gmail.com>.
On Wed, Apr 15, 2009 at 12:39 AM, Development Team <de...@gmail.com> wrote:
> Hi everybody,
>       I have a relatively large index (it will eventually contain ~4M
> documents and be about 3G in size, I think) that indexes user data,
> settings, and the like. The documents represent a community of users
> whereupon a subset of them may be "online" at any time. Also, we want to
> score our search results across searches that span the whole index by the
> online (i.e. presence) status.
>       Right now the list of online members is kept in a database table,
> however we very often need to search on these users. The problem is, we're
> using Solr for our searches and we don't know how to approach setting up a
> search system for a large amount of highly volatile data.
>       How do people typically go about this? Do they do one of the
> following:
>             1) Set up a second core and keep only index the "online"
> members in there? (Then we could not score normal search results by online
> status.)
This will not work because creating an index is quite expensive
>             2) Index the online status in our regular solr index and not
> worry about it? (If it's fast to update docs in a large index, then why not
> maintain real-time data in the main index?)
Do you wish to have the data almost realtime?. That means you will
have to commit too often. It may result in very poor performance

>             3) Just use a database for the presence data and forget about
> using Solr for the presence-related searches?

If the no:of users is low enough to be held in a HashSet in memory,
you can think of implementing a special Field akin to
org.apache.solr.schema.ExternalFileField . But do not hope to make it
realtime. But try to make it close to realtime (say 1 min update of
the hashSet. means fetch the data from DB once in a minute).

>       Is there anything in Solr that I should be looking into to help with
> this problem? I'd appreciate any help.
>
> Sincerely,
>
>       Daryl.
>



-- 
--Noble Paul