You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Peter Boudreau <pe...@makeshop.jp> on 2012/03/09 09:22:14 UTC

Solr DIH and $deleteDocById

Hello everyone,

I’ve got Solr DIH up and running with no problems as far as importing data, but I’m now trying to add some functionality to our delta import to delete invalid records.

The special command $deleteDocById seems to provide what I’m looking for, and just for testing purposes until I get things working, I setup a simple transformer to delete just one document with a specific ID:

<script>
<![CDATA[ 
    function deleteBadDocs(row) {
        var uniqueID = row.get('unique_id');
        if(uniqueID == '1-devpeter-1') { 
            row.put('$deleteDocById', uniqueID); 
        }
        return row; 
    }
]]>
</script>

When I run DIH with this, sure enough, it tells me that 1 document was deleted:

Indexing completed. Added/Updated: 4755 documents. Deleted 1 documents. 

But then when I search the index, the document is still there.  I’ve been googling this for a while now, and found a number of references saying that you need to commit or optimize after this in order for the deletes to take effect, but I was under the impression that DIH both commits and optimizes by default, so shouldn’t it be getting committed and optimized automatically by DIH?  I even tried implicitly setting the commit= and optimize= flags to true, but still, the deleted document was still in the index when I searched.  I also tried restarting Solr, but the deleted document was still there.

Could anyone help me understand why this document which is being reported as deleted still shows up in the index?

Also, there is one thing which I’m unclear on after reading the Solr wiki:

$deleteDocById : Delete a doc from Solr with this id. The value has to be the uniqueKey value of the document. Note that this command can only delete docs already committed to the index. 

I was starting to think that maybe $deleteDocById was only preventing documents from entering the index, and not deleting existing documents which were already in the index, but if I understand this correctly, $deleteDocById should be able to delete a document which was already in the index *before* running DIH, right?

Any help would be very much appreciated.

Thanks in advance,

Peter

Re: Solr DIH and $deleteDocById

Posted by Peter Boudreau <pe...@makeshop.jp>.
Thanks for the info, James.
I failed to mention in my original message that we're on Solr 3.5 and we are 
combining the deletes with our add/updates in the same DIH.

In searching through the archives of this mailing list, I actually found a 
thread which described my problem exactly and led me to a solution:

DIH - deleting documents, high performance (delta) imports, and passing 
parameters
http://lucene.472066.n3.nabble.com/DIH-deleting-documents-high-performance-delta-imports-and-passing-parameters-td1388349.html

The point of confusion for me was that I was assuming $deleteDocById would 
delete the document already in the index, and also prevent that document 
from being re-added to the index, but it only deletes the document in the 
index, it does not prevent the current row from being re-added. So what 
happens is it does delete the document in the index, but then it just gets 
re-added by the import operation.

As suggested by the thread I referenced above, I was able to solve the 
problem nicely by just issuing both a $deleteDocById command and a $skipDoc 
command together, with $deleteDocById deleting the document already in the 
index, and $skipDoc preventing it from being re-added by the current row in 
the import.

Thanks again for taking the time to respond.
It's great how helpful this list is.

- Peter

-----Original Message----- 
From: Dyer, James
Sent: Saturday, March 10, 2012 12:16 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr DIH and $deleteDocById

This (almost) sounds like https://issues.apache.org/jira/browse/SOLR-2492 
which was fixed in Solr 3.4 .. Are you on an earlier version?

But maybe not, because you're seeing the # deleted documents increment, and 
prior to this bug fix (I think) the deleted counter wasn't getting 
incremented either.

Perhaps this is a related bug that only happens when the deletes are added 
via a transformer?  Try a query like this without a transformer:

select uniqueID as '$deleteDocById' from table where uniqueID = 
'1-devpeter-1';

Does this work?  If so, you've probably stumbled on a new bug related to 
SOLR-2492.

In any case, the workaround (probably) is to manually issue a commit after 
doing your deletes.  Or, combine your deletes with add/updates in the same 
DIH run and it should commit automatically as configured.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Peter Boudreau [mailto:peter@makeshop.jp]
Sent: Friday, March 09, 2012 2:22 AM
To: solr-user@lucene.apache.org
Subject: Solr DIH and $deleteDocById

Hello everyone,

I've got Solr DIH up and running with no problems as far as importing data, 
but I'm now trying to add some functionality to our delta import to delete 
invalid records.

The special command $deleteDocById seems to provide what I'm looking for, 
and just for testing purposes until I get things working, I setup a simple 
transformer to delete just one document with a specific ID:

<script>
<![CDATA[
    function deleteBadDocs(row) {
        var uniqueID = row.get('unique_id');
        if(uniqueID == '1-devpeter-1') {
            row.put('$deleteDocById', uniqueID);
        }
        return row;
    }
]]>
</script>

When I run DIH with this, sure enough, it tells me that 1 document was 
deleted:

Indexing completed. Added/Updated: 4755 documents. Deleted 1 documents.

But then when I search the index, the document is still there.  I've been 
googling this for a while now, and found a number of references saying that 
you need to commit or optimize after this in order for the deletes to take 
effect, but I was under the impression that DIH both commits and optimizes 
by default, so shouldn't it be getting committed and optimized automatically 
by DIH?  I even tried implicitly setting the commit= and optimize= flags to 
true, but still, the deleted document was still in the index when I 
searched.  I also tried restarting Solr, but the deleted document was still 
there.

Could anyone help me understand why this document which is being reported as 
deleted still shows up in the index?

Also, there is one thing which I'm unclear on after reading the Solr wiki:

$deleteDocById : Delete a doc from Solr with this id. The value has to be 
the uniqueKey value of the document. Note that this command can only delete 
docs already committed to the index.

I was starting to think that maybe $deleteDocById was only preventing 
documents from entering the index, and not deleting existing documents which 
were already in the index, but if I understand this correctly, 
$deleteDocById should be able to delete a document which was already in the 
index *before* running DIH, right?

Any help would be very much appreciated.

Thanks in advance,

Peter 


RE: Solr DIH and $deleteDocById

Posted by "Dyer, James" <Ja...@ingrambook.com>.
This (almost) sounds like https://issues.apache.org/jira/browse/SOLR-2492 which was fixed in Solr 3.4 .. Are you on an earlier version?

But maybe not, because you're seeing the # deleted documents increment, and prior to this bug fix (I think) the deleted counter wasn't getting incremented either.

Perhaps this is a related bug that only happens when the deletes are added via a transformer?  Try a query like this without a transformer:

select uniqueID as '$deleteDocById' from table where uniqueID = '1-devpeter-1';

Does this work?  If so, you've probably stumbled on a new bug related to SOLR-2492.

In any case, the workaround (probably) is to manually issue a commit after doing your deletes.  Or, combine your deletes with add/updates in the same DIH run and it should commit automatically as configured.

James Dyer
E-Commerce Systems
Ingram Content Group
(615) 213-4311


-----Original Message-----
From: Peter Boudreau [mailto:peter@makeshop.jp] 
Sent: Friday, March 09, 2012 2:22 AM
To: solr-user@lucene.apache.org
Subject: Solr DIH and $deleteDocById

Hello everyone,

I've got Solr DIH up and running with no problems as far as importing data, but I'm now trying to add some functionality to our delta import to delete invalid records.

The special command $deleteDocById seems to provide what I'm looking for, and just for testing purposes until I get things working, I setup a simple transformer to delete just one document with a specific ID:

<script>
<![CDATA[ 
    function deleteBadDocs(row) {
        var uniqueID = row.get('unique_id');
        if(uniqueID == '1-devpeter-1') { 
            row.put('$deleteDocById', uniqueID); 
        }
        return row; 
    }
]]>
</script>

When I run DIH with this, sure enough, it tells me that 1 document was deleted:

Indexing completed. Added/Updated: 4755 documents. Deleted 1 documents. 

But then when I search the index, the document is still there.  I've been googling this for a while now, and found a number of references saying that you need to commit or optimize after this in order for the deletes to take effect, but I was under the impression that DIH both commits and optimizes by default, so shouldn't it be getting committed and optimized automatically by DIH?  I even tried implicitly setting the commit= and optimize= flags to true, but still, the deleted document was still in the index when I searched.  I also tried restarting Solr, but the deleted document was still there.

Could anyone help me understand why this document which is being reported as deleted still shows up in the index?

Also, there is one thing which I'm unclear on after reading the Solr wiki:

$deleteDocById : Delete a doc from Solr with this id. The value has to be the uniqueKey value of the document. Note that this command can only delete docs already committed to the index. 

I was starting to think that maybe $deleteDocById was only preventing documents from entering the index, and not deleting existing documents which were already in the index, but if I understand this correctly, $deleteDocById should be able to delete a document which was already in the index *before* running DIH, right?

Any help would be very much appreciated.

Thanks in advance,

Peter