You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by escher2k <es...@yahoo.com> on 2006/12/21 21:23:10 UTC

Realtime directory change...

Hi,
  We currently use Lucene to do index user data every couple of hours - the
index is completely rebuilt,
the old index is archived and the new one copied over to the directory.
Example -

/bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
/bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
/bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
/bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
/bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help

This works fine since the index is retrieved every time from the disk. Is it
possible to do the same with Solr ? 
Assuming we also use caching to speed up the retrieval, is there a way to
invalidate some/all caches when
this done ?

Thanks.

-- 
View this message in context: http://www.nabble.com/Realtime-directory-change...-tf2867482.html#a8014338
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtime directory change...

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks Chris. So, assuming that we rebuild the index, delete the old data and
: then execute a commit,
: will the snap scripts take care of reconciling all the data ? Internally, is
: there an update timestamp notion
: used to figure out which unique id records have changed and then synchronize
: them by executing delete/insert ops ?

Ummm... i'm not sure i understand you're question, if you've got a
uniqueKey field, then it doesn't matter what kind of timestamps you have,
Solr will automaticaly delete the old records as you add new records with
the same id.  if you don't have a uniqueKey field, and you want to just
reindex your corpus at moment X and then say "anything older then
timestamp X should be deleted" when you are done, then you can just do a
delete by query using a range query on the date X ... having a
timestamp field that records the moment when something was indexed is
actually very easy, just inlcude a date field with the value of "NOW"
(this will be even easier once i get arround to commiting SOLR-82).

bear in mind, it doesn't have to be a date field ... you could also
record a simple "build number" that you incriment each time you "rebuild"



-Hoss


Re: Realtime directory change...

Posted by escher2k <es...@yahoo.com>.
Thanks Chris. So, assuming that we rebuild the index, delete the old data and
then execute a commit,
will the snap scripts take care of reconciling all the data ? Internally, is
there an update timestamp notion
used to figure out which unique id records have changed and then synchronize
them by executing delete/insert ops ?


Chris Hostetter wrote:
> 
> 
> : Thanks. The problem is, it is not easy to do an incremental update on
> the
> : data set.
> : In which case, I guess the index needs to be created in a different path
> and
> : we need to move
> : files around. However, since the documents are added over HTTP, how does
> one
> : even create
> : the index in a different path on the same machine while the application
> is
> : still running ?
> 
> for the record, i don't think you *have* to do this ... allthough it will
> certianly work fine if you want to (since it's just hte master/slave model
> starting with an empty index)
> 
> if in your current model, you have an index which you never modify, and
> you regularly build a new index on a new path and then replace it, you
> could do the same thing with a single Solr instance by indexing all of
> your new documents on the same index, then deleting all docs older then
> your newest "rebuild" (using a timestamp field) and then and only then
> issue a commit to tell Solr to start using the new index.
> 
> as long as no one else issues a commit while you are "rebuilding" your
> index will allways look consistent.
> 
> But as i said: the master/slave model will work perfectly for what you
> want as well -- and the snap* scripts will take care of loading it up on
> your slave.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Realtime-directory-change...-tf2867482.html#a8082000
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtime directory change...

Posted by Chris Hostetter <ho...@fucit.org>.
: Thanks. The problem is, it is not easy to do an incremental update on the
: data set.
: In which case, I guess the index needs to be created in a different path and
: we need to move
: files around. However, since the documents are added over HTTP, how does one
: even create
: the index in a different path on the same machine while the application is
: still running ?

for the record, i don't think you *have* to do this ... allthough it will
certianly work fine if you want to (since it's just hte master/slave model
starting with an empty index)

if in your current model, you have an index which you never modify, and
you regularly build a new index on a new path and then replace it, you
could do the same thing with a single Solr instance by indexing all of
your new documents on the same index, then deleting all docs older then
your newest "rebuild" (using a timestamp field) and then and only then
issue a commit to tell Solr to start using the new index.

as long as no one else issues a commit while you are "rebuilding" your
index will allways look consistent.

But as i said: the master/slave model will work perfectly for what you
want as well -- and the snap* scripts will take care of loading it up on
your slave.



-Hoss


Re: Realtime directory change...

Posted by escher2k <es...@yahoo.com>.
Thanks. The problem is, it is not easy to do an incremental update on the
data set.
In which case, I guess the index needs to be created in a different path and
we need to move
files around. However, since the documents are added over HTTP, how does one
even create
the index in a different path on the same machine while the application is
still running ?

Ideally, what we would want is to recreate a new index from scratch and then
use the master/slave
configuration to copy the indexes to other machines. 


Yonik Seeley wrote:
> 
> On 12/21/06, escher2k <es...@yahoo.com> wrote:
>> Hi,
>>   We currently use Lucene to do index user data every couple of hours -
>> the
>> index is completely rebuilt,
>> the old index is archived and the new one copied over to the directory.
>>
>> Example -
>>
>> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
>> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
>> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
>> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
>> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
>>
>> This works fine since the index is retrieved every time from the disk. Is
>> it
>> possible to do the same with Solr ?
> 
> Yes, this will work.  This is sort of what the index distribution
> scripts do to install a new index snapshot in a master/slave
> configuration.
> 
> You also don't have to build in a different directory if you don't
> want to.  Solr supports incremental updates.
> 
>> Assuming we also use caching to speed up the retrieval, is there a way to
>> invalidate some/all caches when
>> this done ?
> 
> It's done automatically.  You will need to issue a <commit/> to solr
> to get it to read the new index (open a new searcher), and new caches
> will be associated with that new searcher.
> 
> -Yonik
> 
> 

-- 
View this message in context: http://www.nabble.com/Realtime-directory-change...-tf2867482.html#a8017341
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Realtime directory change...

Posted by Yonik Seeley <yo...@apache.org>.
On 12/21/06, escher2k <es...@yahoo.com> wrote:
> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
>
> Example -
>
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
>
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ?

Yes, this will work.  This is sort of what the index distribution
scripts do to install a new index snapshot in a master/slave
configuration.

You also don't have to build in a different directory if you don't
want to.  Solr supports incremental updates.

> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?

It's done automatically.  You will need to issue a <commit/> to solr
to get it to read the new index (open a new searcher), and new caches
will be associated with that new searcher.

-Yonik

Re: Realtime directory change...

Posted by Thorsten Scherler <th...@juntadeandalucia.es>.
On Thu, 2006-12-21 at 12:23 -0800, escher2k wrote:
> Hi,
>   We currently use Lucene to do index user data every couple of hours - the
> index is completely rebuilt,
> the old index is archived and the new one copied over to the directory.
> Example -
> 
> /bin/cp ${LOG_FILE} ${CRON_ROOT}/index/help/
> /bin/rm -rf ${INDEX_ROOT}/archive/help.${DATE}
> /bin/cp -R ${CRON_ROOT}/index/help ${INDEX_ROOT}/help.new
> /bin/mv ${INDEX_ROOT}/help ${INDEX_ROOT}/archive/help.${DATE}
> /bin/mv ${INDEX_ROOT}/help.new ${INDEX_ROOT}/help
> 
> This works fine since the index is retrieved every time from the disk. Is it
> possible to do the same with Solr ? 
> Assuming we also use caching to speed up the retrieval, is there a way to
> invalidate some/all caches when
> this done ?
> 

Did you look into 
http://wiki.apache.org/solr/CollectionDistribution
http://wiki.apache.org/solr/SolrCollectionDistributionScripts
http://wiki.apache.org/solr/SolrCollectionDistributionOperationsOutline

I am still very new to solr but it sounds like it is exactly what you
need (like as well said by others). 

HTH

salu2


> Thanks.
>