You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com> on 2009/08/05 08:51:39 UTC

Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

did you explore the deletedPkQuery ?

On Wed, Aug 5, 2009 at 11:46 AM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> Hi all,
>
> the database from which I populate the SOLR index is refreshed
> "partially". Subsets of the data is deleted and readded for a certain
> group identifier. Is it possible to do something alike in a (delta) import
> of the DataImportHandler?
>
> Example:
> SOLR-Index:
> groupID: 1, PK: 1, refreshDate: [before last_index_time]
> groupID: 1, PK: 2, refreshDate: [before last_index_time]
> groupID: 1, PK: 3, refreshDate: [before last_index_time]
>
> Refreshed DB:
> groupID: 1, PK: 1, refreshDate: [after last_index_time]
> groupID: 1, PK: 5, refreshDate: [after last_index_time]
> groupID: 1, PK: 30, refreshDate: [after last_index_time]
> (PK 2 and 3 are not there, anymore. PK is unique across all groupIDs)
>
> deleteQuery="groupID:1"
> (An attribute of the entity element that the DocBuilder (1.3) reads and
> sends as query once, before the delta import, unchanged to the SOLR
> writer to delete documents.)
>
> After that, the delta import loads data with groupID=1 from the DB.
>
> Could I plug into SOLR with maybe a custom processor to achieve
> something in the direction of:
>
> deleteInput="select FIELD_VALUE from TABLE where CHANGED_DATE >
> '${dataimporter.last_index_time}' group by FIELD_VALUE"
> deleteQuery="field:${my_entity.FIELD_VALUE}"
>
> FIELD_VALUE is not the primary key, and the "deleteInput" query can
> return multiple rows.
>
>
> I am aware of SOLR-1060 and SOLR-1059 but I am not sure that those will
> help me. In those cases it looks like the delete is run per entity. I
> want the delete to run before the (delta)import, once.
> If that impression is wrong, I'll happily switch to 1.4, of course.
>
> Cheers!
> Chantal
>
>
> --
> Chantal Ackermann
>
>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Great! *bow*
Thanks,
Chantal


>  <entity name="delete_from_index" pk="GROUPID" transformer="TemplateTransformer"
>         query="select GROUPID from DEFINITION
>         where LANGUAGE='de'
>                 and CHANGED_DATE > '${dataimporter.last_index_time}'" >
>         <field column="$deleteDocByQuery"
>                 template="groupid:${delete_from_index.GROUPID}"/>
>  </entity>
> 
> this should do the trick
> 


Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

Posted by Noble Paul നോബിള്‍ नोब्ळ् <no...@corp.aol.com>.
On Thu, Aug 6, 2009 at 6:41 PM, Chantal
Ackermann<ch...@btelligent.de> wrote:
> Hi again,
>
> 1.4 runs fine for me, now, but I'm still struggling for the correct delete
> query. There is few to no documentation at all for the new special commands,
> and I have problems guessing the correct setup from reading through the
> code. SORL-1060 is not enough help.
>
> I've come up with a separate entity for issueing the deletes - is that how
> it should be done? How can I make sure that these deletes are issued before
> the other entities are processed? Is it enough to put it as first entity in
> the config?
this is a better solution
>
> <entity name="delete_from_index" pk="GROUPID"
>        query="select GROUPID from DEFINITION
>        where LANGUAGE='de'
>                and CHANGED_DATE > '${dataimporter.last_index_time}'">
>        <field column="$deleteDocByQuery"
>                query="groupid:${delete_from_index.GROUPID}"/>
> </entity>
>
> I am not sure about which attributes to use for the "field" element when
> using that special command. I've put the query in with attribute "query"
> which is probably not right because I just wrote it down like that. Should I
> use the "name" attribute as done for the regular fields? Do I need to add a
> $skipDoc command (like it's mentioned in SOLR-1060)?

 <entity name="delete_from_index" pk="GROUPID" transformer="TemplateTransformer"
        query="select GROUPID from DEFINITION
        where LANGUAGE='de'
                and CHANGED_DATE > '${dataimporter.last_index_time}'" >
        <field column="$deleteDocByQuery"
                template="groupid:${delete_from_index.GROUPID}"/>
 </entity>

this should do the trick



>
> Thanks!
> Chantal
>
>
>
> Chantal Ackermann schrieb:
>>
>> Thanks, Paul! :-)
>>
>> The wiki doesn't mark $deleteDocByQuery (and the other special commands)
>> as 1.4, as it usually does. Maybe it's worth correcting that?
>>
>> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>>>
>>> ok, writing an EntityProcessor/Transofrmer may help here use the special
>>> command
>>>
>>> http://wiki.apache.org/solr/DataImportHandler#head-5e9ebf5a2aaa1dc54464102c395ed1bf7cdb98c3
>>>
>>> $deleteDocByQuery is what you need .
>>>
>



-- 
-----------------------------------------------------
Noble Paul | Principal Engineer| AOL | http://aol.com

Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi again,

1.4 runs fine for me, now, but I'm still struggling for the correct 
delete query. There is few to no documentation at all for the new 
special commands, and I have problems guessing the correct setup from 
reading through the code. SORL-1060 is not enough help.

I've come up with a separate entity for issueing the deletes - is that 
how it should be done? How can I make sure that these deletes are issued 
before the other entities are processed? Is it enough to put it as first 
entity in the config?

<entity name="delete_from_index" pk="GROUPID"
	query="select GROUPID from DEFINITION
	where LANGUAGE='de'
		and CHANGED_DATE > '${dataimporter.last_index_time}'">
	<field column="$deleteDocByQuery"
		query="groupid:${delete_from_index.GROUPID}"/>
</entity>

I am not sure about which attributes to use for the "field" element when 
using that special command. I've put the query in with attribute "query" 
which is probably not right because I just wrote it down like that. 
Should I use the "name" attribute as done for the regular fields? Do I 
need to add a $skipDoc command (like it's mentioned in SOLR-1060)?

Thanks!
Chantal



Chantal Ackermann schrieb:
> Thanks, Paul! :-)
> 
> The wiki doesn't mark $deleteDocByQuery (and the other special commands)
> as 1.4, as it usually does. Maybe it's worth correcting that?
> 
> Noble Paul നോബിള്‍ नोब्ळ् schrieb:
>> ok, writing an EntityProcessor/Transofrmer may help here use the special command
>> http://wiki.apache.org/solr/DataImportHandler#head-5e9ebf5a2aaa1dc54464102c395ed1bf7cdb98c3
>>
>> $deleteDocByQuery is what you need .
>>

Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Thanks, Paul! :-)

The wiki doesn't mark $deleteDocByQuery (and the other special commands) 
as 1.4, as it usually does. Maybe it's worth correcting that?

Noble Paul നോബിള്‍ नोब्ळ् schrieb:
> ok, writing an EntityProcessor/Transofrmer may help here use the special command
> http://wiki.apache.org/solr/DataImportHandler#head-5e9ebf5a2aaa1dc54464102c395ed1bf7cdb98c3
> 
> $deleteDocByQuery is what you need .
> 

Re: DataImportHandler: Partial Delete and Update (Hacking "deleteQuery" in SOLR 1.3?)

Posted by Chantal Ackermann <ch...@btelligent.de>.
Hi Paul,

yes, I did and I just verified in the code. The deletedPkQuery is used 
to collect all primary keys of the root entity that shall be deleted 
from the index.

The deletion is done on the SOLR writer by unique ID:
       writer.deleteDoc(deletedKey.get(root.pk)); //DocBuilder

       delCmd.id = id.toString(); // SOLR Writer deleteDoc()
       delCmd.fromPending = true;
       delCmd.fromCommitted = true;
       processor.processDelete(delCmd);

// RunUpdateProcessorFactory
   @Override
   public void processDelete(DeleteUpdateCommand cmd) throws IOException {
     if( cmd.id != null ) {
       updateHandler.delete(cmd); // writer.deleteDoc() uses that
     }
     else {
       updateHandler.deleteByQuery(cmd); // I would like to use that
     }
     super.processDelete(cmd);
   }

My problem is that the ids I have to delete are those that do not exist 
in the database anymore. So, I have no means to return them by DB query. 
That is why I would like to use a different field that a group of 
documents has in common, and that would allow me to get hold of the 
outdated documents in the index. (But I have to find out the value of 
that other field by DB query.)

Cheers,
Chantal


Noble Paul നോബിള്‍ नोब्ळ् schrieb:
> did you explore the deletedPkQuery ?
> 
> On Wed, Aug 5, 2009 at 11:46 AM, Chantal
> Ackermann<ch...@btelligent.de> wrote:
>> Hi all,
>>
>> the database from which I populate the SOLR index is refreshed
>> "partially". Subsets of the data is deleted and readded for a certain
>> group identifier. Is it possible to do something alike in a (delta) import
>> of the DataImportHandler?
>>
>> Example:
>> SOLR-Index:
>> groupID: 1, PK: 1, refreshDate: [before last_index_time]
>> groupID: 1, PK: 2, refreshDate: [before last_index_time]
>> groupID: 1, PK: 3, refreshDate: [before last_index_time]
>>
>> Refreshed DB:
>> groupID: 1, PK: 1, refreshDate: [after last_index_time]
>> groupID: 1, PK: 5, refreshDate: [after last_index_time]
>> groupID: 1, PK: 30, refreshDate: [after last_index_time]
>> (PK 2 and 3 are not there, anymore. PK is unique across all groupIDs)
>>
>> deleteQuery="groupID:1"
>> (An attribute of the entity element that the DocBuilder (1.3) reads and
>> sends as query once, before the delta import, unchanged to the SOLR
>> writer to delete documents.)
>>
>> After that, the delta import loads data with groupID=1 from the DB.
>>
>> Could I plug into SOLR with maybe a custom processor to achieve
>> something in the direction of:
>>
>> deleteInput="select FIELD_VALUE from TABLE where CHANGED_DATE >
>> '${dataimporter.last_index_time}' group by FIELD_VALUE"
>> deleteQuery="field:${my_entity.FIELD_VALUE}"
>>
>> FIELD_VALUE is not the primary key, and the "deleteInput" query can
>> return multiple rows.
>>
>>
>> I am aware of SOLR-1060 and SOLR-1059 but I am not sure that those will
>> help me. In those cases it looks like the delete is run per entity. I
>> want the delete to run before the (delta)import, once.
>> If that impression is wrong, I'll happily switch to 1.4, of course.
>>
>> Cheers!
>> Chantal
>>
>>
>> --
>> Chantal Ackermann
>>
>>
>>
> 
> 
> 
> --
> -----------------------------------------------------
> Noble Paul | Principal Engineer| AOL | http://aol.com