You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Koji Sekiguchi <ko...@r.email.ne.jp> on 2010/12/04 12:40:05 UTC

Re: Problem with DIH delta-import delete.

(10/11/17 20:18), Matti Oinas wrote:
> Solr does not delete documents from index although delta-import says
> it has deleted n documents from index. I'm using version 1.4.1.
>
> The schema looks like
>
>   <fields>
>      <field name="uuid" type="string" indexed="true" stored="true"
> required="true" />
>      <field name="type" type="int" indexed="true" stored="true"
> required="true" />
>      <field name="blog_id" type="int" indexed="true" stored="true" />
>      <field name="entry_id" type="int" indexed="false" stored="true" />
>      <field name="content" type="textgen" indexed="true" stored="true" />
>   </fields>
>   <uniqueKey>uuid</uniqueKey>
>
>
> Relevant fields from database tables:
>
> TABLE: blogs and entries both have
>
>    Field: id
>     Type: int(11)
>     Null: NO
>      Key: PRI
> Default: NULL
>    Extra: auto_increment
> ------------------------------------
>    Field: modified
>     Type: datetime
>     Null: YES
>      Key:
> Default: NULL
>    Extra:
> ------------------------------------
>    Field: status
>     Type: tinyint(1) unsigned
>     Null: YES
>      Key:
> Default: NULL
>    Extra:
>
>
> <?xml version="1.0" encoding="UTF-8" ?>
> <dataConfig>
> 	<dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver".../>
> 	<document>
> 		<entity name="blog"
> 				pk="id"
> 				query="SELECT id,description,1 as type FROM blogs WHERE status=2"
> 				deltaImportQuery="SELECT id,description,1 as type FROM blogs WHERE
> status=2 AND id='${dataimporter.delta.id}'"
> 				deltaQuery="SELECT id FROM blogs WHERE
> '${dataimporter.last_index_time}'&lt; modified AND status=2"
> 				deletedPkQuery="SELECT id FROM blogs WHERE
> '${dataimporter.last_index_time}'&lt;= modified AND status=3"
> 				transformer="TemplateTransformer">
> 			<field column="uuid" name="uuid" template="blog-${blog.id}" />
> 			<field column="id" name="blog_id" />
> 			<field column="description" name="content" />
> 			<field column="type" name="type" />
> 		</entity>
> 		<entity name="entry"
> 				pk="id"
> 				query="SELECT f.id as id,f.content,f.blog_id,2 as type FROM
> entries f,blogs b WHERE f.blog_id=b.id AND b.status=2"
> 				deltaImportQuery="SELECT f.id as id,f.content,f.blog_id,2 as type
> FROM entries f,blogs b WHERE f.blog_id=b.id AND
> f.id='${dataimporter.delta.id}'"
> 				deltaQuery="SELECT f.id as id FROM entries f JOIN blogs b ON
> b.id=f.blog_id WHERE '${dataimporter.last_index_time}'&lt; b.modified
> AND b.status=2"
> 				deletedPkQuery="SELECT f.id as id FROM entries f JOIN blogs b ON
> b.id=f.blog_id WHERE b.status!=2 AND '${dataimporter.last_index_time}'
> &lt; b.modified"
> 				transformer="HTMLStripTransformer,TemplateTransformer">
> 			<field column="uuid" name="uuid" template="entry-${entry.id}" />
> 			<field column="id" name="entry_id" />
> 			<field column="blog_id" name="blog_id" />
> 			<field column="content" name="content" stripHTML="true" />
> 			<field column="type" name="type" />
> 		</entity>
> 	</document>
> </dataConfig>
>
> Full import and delta import works without problems when it comes to
> adding new documents to the index but when blog is deleted (status is
> set to 3 in database), solr report after delta import is something
> like "Indexing completed. Added/Updated: 0 documents. Deleted 81
> documents.". The problem is that documents are still found from solr
> index.
>
> 1. UPDATE blogs SET modified=NOW(),status=3 WHERE id=26;
>
> 2. delta-import =>
>
> <str name="">
> Indexing completed. Added/Updated: 0 documents. Deleted 81 documents.
> </str>
> <str name="Committed">2010-11-17 13:00:50</str>
> <str name="Optimized">2010-11-17 13:00:50</str>
>
> So solr says it has deleted documents and that index is also optimzed
> and committed after the operation.
>
> 3. Search; blog_id:26 still returns 1 document with type 1 (blog) and
> 80 documents with type 2 (entry).
>

Hi Matti,

Can you see something like the following "Completed DeletedRowKey for Entity"
and then "Deleting document: ID-1" in your solr log?

(sample messages from my Solr log)
Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder collectDelta
INFO: Completed DeletedRowKey for Entity: product rows obtained : 2
   :
Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder deleteAll
INFO: Deleting stale documents
Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
INFO: Deleting document: OVEN-2
   :

If you cannot find these messages, I think there is something incorrect
setting (but I couldn't find incorrect ones in your data-config.xml...).

Koji
-- 
http://www.rondhuit.com/en/

Re: Problem with DIH delta-import delete.

Posted by Matti Oinas <ma...@gmail.com>.
Problem was incorrect pk definition on data-config.xml

<entity name="blog"
                               pk="id"
                             .......
                       <field column="uuid" name="uuid"
template="blog-${blog.id}" />
                       <field column="id" name="blog_id" />

pk attribute needs to be the same as Solr uniqueField, so in my case
changing pk value from id to uuid solved the problem.


2010/12/7 Matti Oinas <ma...@gmail.com>:
> Thanks Koji.
>
> Problem seems to be that template transformer is not used when delete
> is performed.
>
> ...
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
> INFO: Completed ModifiedRowKey for Entity: entry rows obtained : 0
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
> INFO: Completed DeletedRowKey for Entity: entry rows obtained : 1223
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
> INFO: Completed parentDeltaQuery for Entity: entry
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder deleteAll
> INFO: Deleting stale documents
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
> INFO: Deleting document: 787
> Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
> INFO: Deleting document: 786
> ...
>
> There are entries with id 787 and 786 in database and those are marked
> as deleted. Query returns right number of deleted documents and right
> rows from database but delete fails because solr is using plain
> numeric id when deleting document. The same happens with blogs also.
>
> Matti
>
>
> 2010/12/4 Koji Sekiguchi <ko...@r.email.ne.jp>:
>> (10/11/17 20:18), Matti Oinas wrote:
>>>
>>> Solr does not delete documents from index although delta-import says
>>> it has deleted n documents from index. I'm using version 1.4.1.
>>>
>>> The schema looks like
>>>
>>>  <fields>
>>>     <field name="uuid" type="string" indexed="true" stored="true"
>>> required="true" />
>>>     <field name="type" type="int" indexed="true" stored="true"
>>> required="true" />
>>>     <field name="blog_id" type="int" indexed="true" stored="true" />
>>>     <field name="entry_id" type="int" indexed="false" stored="true" />
>>>     <field name="content" type="textgen" indexed="true" stored="true" />
>>>  </fields>
>>>  <uniqueKey>uuid</uniqueKey>
>>>
>>>
>>> Relevant fields from database tables:
>>>
>>> TABLE: blogs and entries both have
>>>
>>>   Field: id
>>>    Type: int(11)
>>>    Null: NO
>>>     Key: PRI
>>> Default: NULL
>>>   Extra: auto_increment
>>> ------------------------------------
>>>   Field: modified
>>>    Type: datetime
>>>    Null: YES
>>>     Key:
>>> Default: NULL
>>>   Extra:
>>> ------------------------------------
>>>   Field: status
>>>    Type: tinyint(1) unsigned
>>>    Null: YES
>>>     Key:
>>> Default: NULL
>>>   Extra:
>>>
>>>
>>> <?xml version="1.0" encoding="UTF-8" ?>
>>> <dataConfig>
>>>        <dataSource type="JdbcDataSource"
>>> driver="com.mysql.jdbc.Driver".../>
>>>        <document>
>>>                <entity name="blog"
>>>                                pk="id"
>>>                                query="SELECT id,description,1 as type FROM
>>> blogs WHERE status=2"
>>>                                deltaImportQuery="SELECT id,description,1
>>> as type FROM blogs WHERE
>>> status=2 AND id='${dataimporter.delta.id}'"
>>>                                deltaQuery="SELECT id FROM blogs WHERE
>>> '${dataimporter.last_index_time}'&lt; modified AND status=2"
>>>                                deletedPkQuery="SELECT id FROM blogs WHERE
>>> '${dataimporter.last_index_time}'&lt;= modified AND status=3"
>>>                                transformer="TemplateTransformer">
>>>                        <field column="uuid" name="uuid"
>>> template="blog-${blog.id}" />
>>>                        <field column="id" name="blog_id" />
>>>                        <field column="description" name="content" />
>>>                        <field column="type" name="type" />
>>>                </entity>
>>>                <entity name="entry"
>>>                                pk="id"
>>>                                query="SELECT f.id as
>>> id,f.content,f.blog_id,2 as type FROM
>>> entries f,blogs b WHERE f.blog_id=b.id AND b.status=2"
>>>                                deltaImportQuery="SELECT f.id as
>>> id,f.content,f.blog_id,2 as type
>>> FROM entries f,blogs b WHERE f.blog_id=b.id AND
>>> f.id='${dataimporter.delta.id}'"
>>>                                deltaQuery="SELECT f.id as id FROM entries
>>> f JOIN blogs b ON
>>> b.id=f.blog_id WHERE '${dataimporter.last_index_time}'&lt; b.modified
>>> AND b.status=2"
>>>                                deletedPkQuery="SELECT f.id as id FROM
>>> entries f JOIN blogs b ON
>>> b.id=f.blog_id WHERE b.status!=2 AND '${dataimporter.last_index_time}'
>>> &lt; b.modified"
>>>
>>>  transformer="HTMLStripTransformer,TemplateTransformer">
>>>                        <field column="uuid" name="uuid"
>>> template="entry-${entry.id}" />
>>>                        <field column="id" name="entry_id" />
>>>                        <field column="blog_id" name="blog_id" />
>>>                        <field column="content" name="content"
>>> stripHTML="true" />
>>>                        <field column="type" name="type" />
>>>                </entity>
>>>        </document>
>>> </dataConfig>
>>>
>>> Full import and delta import works without problems when it comes to
>>> adding new documents to the index but when blog is deleted (status is
>>> set to 3 in database), solr report after delta import is something
>>> like "Indexing completed. Added/Updated: 0 documents. Deleted 81
>>> documents.". The problem is that documents are still found from solr
>>> index.
>>>
>>> 1. UPDATE blogs SET modified=NOW(),status=3 WHERE id=26;
>>>
>>> 2. delta-import =>
>>>
>>> <str name="">
>>> Indexing completed. Added/Updated: 0 documents. Deleted 81 documents.
>>> </str>
>>> <str name="Committed">2010-11-17 13:00:50</str>
>>> <str name="Optimized">2010-11-17 13:00:50</str>
>>>
>>> So solr says it has deleted documents and that index is also optimzed
>>> and committed after the operation.
>>>
>>> 3. Search; blog_id:26 still returns 1 document with type 1 (blog) and
>>> 80 documents with type 2 (entry).
>>>
>>
>> Hi Matti,
>>
>> Can you see something like the following "Completed DeletedRowKey for
>> Entity"
>> and then "Deleting document: ID-1" in your solr log?
>>
>> (sample messages from my Solr log)
>> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> collectDelta
>> INFO: Completed DeletedRowKey for Entity: product rows obtained : 2
>>  :
>> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder
>> deleteAll
>> INFO: Deleting stale documents
>> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.SolrWriter
>> deleteDoc
>> INFO: Deleting document: OVEN-2
>>  :
>>
>> If you cannot find these messages, I think there is something incorrect
>> setting (but I couldn't find incorrect ones in your data-config.xml...).
>>
>> Koji
>> --
>> http://www.rondhuit.com/en/
>>
>

Re: Problem with DIH delta-import delete.

Posted by Matti Oinas <ma...@gmail.com>.
Thanks Koji.

Problem seems to be that template transformer is not used when delete
is performed.

...
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed ModifiedRowKey for Entity: entry rows obtained : 0
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed DeletedRowKey for Entity: entry rows obtained : 1223
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder
collectDelta
INFO: Completed parentDeltaQuery for Entity: entry
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.DocBuilder deleteAll
INFO: Deleting stale documents
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
INFO: Deleting document: 787
Dec 7, 2010 7:19:43 AM org.apache.solr.handler.dataimport.SolrWriter deleteDoc
INFO: Deleting document: 786
...

There are entries with id 787 and 786 in database and those are marked
as deleted. Query returns right number of deleted documents and right
rows from database but delete fails because solr is using plain
numeric id when deleting document. The same happens with blogs also.

Matti


2010/12/4 Koji Sekiguchi <ko...@r.email.ne.jp>:
> (10/11/17 20:18), Matti Oinas wrote:
>>
>> Solr does not delete documents from index although delta-import says
>> it has deleted n documents from index. I'm using version 1.4.1.
>>
>> The schema looks like
>>
>>  <fields>
>>     <field name="uuid" type="string" indexed="true" stored="true"
>> required="true" />
>>     <field name="type" type="int" indexed="true" stored="true"
>> required="true" />
>>     <field name="blog_id" type="int" indexed="true" stored="true" />
>>     <field name="entry_id" type="int" indexed="false" stored="true" />
>>     <field name="content" type="textgen" indexed="true" stored="true" />
>>  </fields>
>>  <uniqueKey>uuid</uniqueKey>
>>
>>
>> Relevant fields from database tables:
>>
>> TABLE: blogs and entries both have
>>
>>   Field: id
>>    Type: int(11)
>>    Null: NO
>>     Key: PRI
>> Default: NULL
>>   Extra: auto_increment
>> ------------------------------------
>>   Field: modified
>>    Type: datetime
>>    Null: YES
>>     Key:
>> Default: NULL
>>   Extra:
>> ------------------------------------
>>   Field: status
>>    Type: tinyint(1) unsigned
>>    Null: YES
>>     Key:
>> Default: NULL
>>   Extra:
>>
>>
>> <?xml version="1.0" encoding="UTF-8" ?>
>> <dataConfig>
>>        <dataSource type="JdbcDataSource"
>> driver="com.mysql.jdbc.Driver".../>
>>        <document>
>>                <entity name="blog"
>>                                pk="id"
>>                                query="SELECT id,description,1 as type FROM
>> blogs WHERE status=2"
>>                                deltaImportQuery="SELECT id,description,1
>> as type FROM blogs WHERE
>> status=2 AND id='${dataimporter.delta.id}'"
>>                                deltaQuery="SELECT id FROM blogs WHERE
>> '${dataimporter.last_index_time}'&lt; modified AND status=2"
>>                                deletedPkQuery="SELECT id FROM blogs WHERE
>> '${dataimporter.last_index_time}'&lt;= modified AND status=3"
>>                                transformer="TemplateTransformer">
>>                        <field column="uuid" name="uuid"
>> template="blog-${blog.id}" />
>>                        <field column="id" name="blog_id" />
>>                        <field column="description" name="content" />
>>                        <field column="type" name="type" />
>>                </entity>
>>                <entity name="entry"
>>                                pk="id"
>>                                query="SELECT f.id as
>> id,f.content,f.blog_id,2 as type FROM
>> entries f,blogs b WHERE f.blog_id=b.id AND b.status=2"
>>                                deltaImportQuery="SELECT f.id as
>> id,f.content,f.blog_id,2 as type
>> FROM entries f,blogs b WHERE f.blog_id=b.id AND
>> f.id='${dataimporter.delta.id}'"
>>                                deltaQuery="SELECT f.id as id FROM entries
>> f JOIN blogs b ON
>> b.id=f.blog_id WHERE '${dataimporter.last_index_time}'&lt; b.modified
>> AND b.status=2"
>>                                deletedPkQuery="SELECT f.id as id FROM
>> entries f JOIN blogs b ON
>> b.id=f.blog_id WHERE b.status!=2 AND '${dataimporter.last_index_time}'
>> &lt; b.modified"
>>
>>  transformer="HTMLStripTransformer,TemplateTransformer">
>>                        <field column="uuid" name="uuid"
>> template="entry-${entry.id}" />
>>                        <field column="id" name="entry_id" />
>>                        <field column="blog_id" name="blog_id" />
>>                        <field column="content" name="content"
>> stripHTML="true" />
>>                        <field column="type" name="type" />
>>                </entity>
>>        </document>
>> </dataConfig>
>>
>> Full import and delta import works without problems when it comes to
>> adding new documents to the index but when blog is deleted (status is
>> set to 3 in database), solr report after delta import is something
>> like "Indexing completed. Added/Updated: 0 documents. Deleted 81
>> documents.". The problem is that documents are still found from solr
>> index.
>>
>> 1. UPDATE blogs SET modified=NOW(),status=3 WHERE id=26;
>>
>> 2. delta-import =>
>>
>> <str name="">
>> Indexing completed. Added/Updated: 0 documents. Deleted 81 documents.
>> </str>
>> <str name="Committed">2010-11-17 13:00:50</str>
>> <str name="Optimized">2010-11-17 13:00:50</str>
>>
>> So solr says it has deleted documents and that index is also optimzed
>> and committed after the operation.
>>
>> 3. Search; blog_id:26 still returns 1 document with type 1 (blog) and
>> 80 documents with type 2 (entry).
>>
>
> Hi Matti,
>
> Can you see something like the following "Completed DeletedRowKey for
> Entity"
> and then "Deleting document: ID-1" in your solr log?
>
> (sample messages from my Solr log)
> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder
> collectDelta
> INFO: Completed DeletedRowKey for Entity: product rows obtained : 2
>  :
> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.DocBuilder
> deleteAll
> INFO: Deleting stale documents
> Dec 4, 2010 8:25:40 PM org.apache.solr.handler.dataimport.SolrWriter
> deleteDoc
> INFO: Deleting document: OVEN-2
>  :
>
> If you cannot find these messages, I think there is something incorrect
> setting (but I couldn't find incorrect ones in your data-config.xml...).
>
> Koji
> --
> http://www.rondhuit.com/en/
>