You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Daniel Shane <sh...@LEXUM.UMontreal.CA> on 2010/02/16 20:29:30 UTC

Preventing mass index delete via DataImportHandler full-import

I've setup a simple DIH import handler with Solr that connects via a database to my data.

I have a small worry though. When I call the full-import functions, can I configure Solr (via the XML files) to make sure there are rows to index before wiping everything? What worries me is if, for some unknown reason, we have an empty database, then the full-import will just wipe the live index and the search will be broken.

I don't think its possible, but I'm new to Solr so its quite possible I've overlooked how this could be done.

Thanks in advance for any help!
Daniel Shane

Re: Preventing mass index delete via DataImportHandler full-import

Posted by Chris Hostetter <ho...@fucit.org>.
: Thats what I thought. I think I'll take the time to add something to the 
: DIH to prevent such things. Maybe a parameter that will cause the import 
: to bail out if the documents to index are less than X % of the total 
: number of documents already in the index.

the devils in the details though ... to do an efficient "full-import" DIH 
deletes hte index before it starts indexing anything, and for an 
arbitrary datasource with an arbitrary set of entities and sub entities 
and various layers of logic it seems like it would be infeasible to know 
how many rows you are going to get before you actually start.

I think this sort of thing would pretty much have to be done post-import 
(w/o doing the initial delete), counting the number of docs adding, and 
deleting all of the ones older then that (using a deleteQuery based on a 
timestamp field) if the number is above a percentage threshold.

Of course: none of this helps you with the possibility that you have 
plenty of docs, but they all contain useless data (maybe some nested 
entity query failed so you have no searchable text) ... logic for sanity 
checking an index tends to be fairly domain specific.



-Hoss


Re: Preventing mass index delete via DataImportHandler full-import

Posted by Daniel Shane <sh...@LEXUM.UMontreal.CA>.
Thats what I thought. I think I'll take the time to add something to the DIH to prevent such things. Maybe a parameter that will cause the import to bail out if the documents to index are less than X % of the total number of documents already in the index.

There would also be a parameter to override this manually.

I think it would be a good safety precaution.

Daniel Shane

----- Original Message -----
From: "Noble Paul നോബിള്‍ नोब्ळ्" <no...@corp.aol.com>
To: solr-user@lucene.apache.org
Sent: Wednesday, February 17, 2010 12:36:52 AM
Subject: Re: Preventing mass index delete via DataImportHandler full-import

On Wed, Feb 17, 2010 at 8:03 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : I have a small worry though. When I call the full-import functions, can
> : I configure Solr (via the XML files) to make sure there are rows to
> : index before wiping everything? What worries me is if, for some unknown
> : reason, we have an empty database, then the full-import will just wipe
> : the live index and the search will be broken.
>
> I believe if you set clear=false when doing the full-import, DIH won't
it is clean=false

or use command=import instead of command=full-import
> delete the entire index before it starts.  it probably makes the
> full-import slower (most of the adds wind up being deletes followed by
> adds) but it should prevent you from having an empty index if something
> goes wrong with your DB.
>
> the big catch is you now have to be responsible for managing deletes
> (using the XmlUpdateRequestHandler) yourself ... this bug looks like it's
> goal is to make this easier to deal with (but i'd not really clear to
> me what "deletedPkQuery" is ... it doesnt' seem to be documented.
>
> https://issues.apache.org/jira/browse/SOLR-1168
>
>
>
> -Hoss
>
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: Preventing mass index delete via DataImportHandler full-import

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Wed, Feb 17, 2010 at 8:03 AM, Chris Hostetter
<ho...@fucit.org> wrote:
>
> : I have a small worry though. When I call the full-import functions, can
> : I configure Solr (via the XML files) to make sure there are rows to
> : index before wiping everything? What worries me is if, for some unknown
> : reason, we have an empty database, then the full-import will just wipe
> : the live index and the search will be broken.
>
> I believe if you set clear=false when doing the full-import, DIH won't
it is clean=false

or use command=import instead of command=full-import
> delete the entire index before it starts.  it probably makes the
> full-import slower (most of the adds wind up being deletes followed by
> adds) but it should prevent you from having an empty index if something
> goes wrong with your DB.
>
> the big catch is you now have to be responsible for managing deletes
> (using the XmlUpdateRequestHandler) yourself ... this bug looks like it's
> goal is to make this easier to deal with (but i'd not really clear to
> me what "deletedPkQuery" is ... it doesnt' seem to be documented.
>
> https://issues.apache.org/jira/browse/SOLR-1168
>
>
>
> -Hoss
>
>



-- 
-----------------------------------------------------
Noble Paul | Systems Architect| AOL | http://aol.com

Re: Preventing mass index delete via DataImportHandler full-import

Posted by Chris Hostetter <ho...@fucit.org>.
: I have a small worry though. When I call the full-import functions, can 
: I configure Solr (via the XML files) to make sure there are rows to 
: index before wiping everything? What worries me is if, for some unknown 
: reason, we have an empty database, then the full-import will just wipe 
: the live index and the search will be broken.

I believe if you set clear=false when doing the full-import, DIH won't 
delete the entire index before it starts.  it probably makes the 
full-import slower (most of the adds wind up being deletes followed by 
adds) but it should prevent you from having an empty index if something 
goes wrong with your DB.

the big catch is you now have to be responsible for managing deletes 
(using the XmlUpdateRequestHandler) yourself ... this bug looks like it's 
goal is to make this easier to deal with (but i'd not really clear to 
me what "deletedPkQuery" is ... it doesnt' seem to be documented.

https://issues.apache.org/jira/browse/SOLR-1168



-Hoss