You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by realw5 <dr...@improvementdirect.com> on 2007/11/07 05:44:04 UTC

SOLR 1.2 - Duplicate Documents??

Hey all, I have a fairly odd case of duplicate documents in our solr index
(See attached xml sample). THe index is roughtly 35k in documents. The only
way I've found to fix the problem is to run a delete statement by id, which
deletes both, I can then re-index that one document. This happened
previosuly but it ended up being an issue with case-sensitivity but this
time the id's appear identical! 

Any assistance in tracking this down would be appeciated! I can provide any
other logs if nesseary.

Thanks,

Dan

Sample Select Query:
  <?xml version="1.0" encoding="UTF-8" ?> 
- <response>
- <lst name="responseHeader">
  <int name="status">0</int> 
  <int name="QTime">0</int> 
  </lst>
- <result name="response" numFound="2" start="0">
- <doc>
- <arr name="categoryId">
  <int>151</int> 
  <int>962</int> 
  <int>1493</int> 
  <int>1830</int> 
  </arr>
- <arr name="finish">
  <str>N/A</str> 
  </arr>
  <bool name="hasDigiCast">false</bool> 
  <bool name="hasDigiVista">false</bool> 
  <str name="id">hr-802waclighting</str> 
- <arr name="inStock">
  <bool>false</bool> 
  </arr>
  <bool name="isNew">false</bool> 
  <bool name="isTopSeller">true</bool> 
  <str name="manufacturer">wac lighting</str> 
- <arr name="masterFinish">
  <str>not applicable</str> 
  </arr>
  <date name="modifiedDate">2007-10-15T23:10:01.510Z</date> 
  <bool name="onSale">false</bool> 
  <int name="popularity">1683</int> 
- <arr name="price">
  <float>53.91</float> 
  </arr>
  <date name="productAddDate">2007-07-05T00:00:00Z</date> 
  <str name="productID">HR-802</str> 
  <str name="productTitle">Low Voltage Miniature Housing for Recessed
Lighting Fixture</str> 
  <str name="series">low voltage miniature housings</str> 
- <arr name="sku">
  <str /> 
  </arr>
  <str name="theme" /> 
- <arr name="upc">
  <str /> 
  </arr>
  </doc>
- <doc>
- <arr name="categoryId">
  <int>151</int> 
  <int>962</int> 
  <int>1493</int> 
  <int>1830</int> 
  </arr>
- <arr name="finish">
  <str>N/A</str> 
  </arr>
  <bool name="hasDigiCast">false</bool> 
  <bool name="hasDigiVista">false</bool> 
  <str name="id">hr-802waclighting</str> 
- <arr name="inStock">
  <bool>false</bool> 
  </arr>
  <bool name="isNew">false</bool> 
  <bool name="isTopSeller">true</bool> 
  <str name="manufacturer">wac lighting</str> 
- <arr name="masterFinish">
  <str>not applicable</str> 
  </arr>
  <date name="modifiedDate">2007-11-02T15:33:21.154Z</date> 
  <bool name="onSale">false</bool> 
  <int name="popularity">1683</int> 
- <arr name="price">
  <float>53.91</float> 
  </arr>
  <date name="productAddDate">2007-07-05T00:00:00Z</date> 
  <str name="productID">HR-802</str> 
  <str name="productTitle">Low Voltage Miniature Housing for Recessed
Lighting Fixture</str> 
  <str name="series">low voltage miniature housings</str> 
- <arr name="sku">
  <str /> 
  </arr>
  <str name="theme" /> 
- <arr name="upc">
  <str /> 
  </arr>
  </doc>
  </result>
  </response>

Schema.xml
 <field name="id" type="string" indexed="true" stored="true"/>
   <field name="sku" type="textTight" indexed="true" stored="true"
multiValued="true"/>
   <field name="upc" type="textTight" indexed="true" stored="true"
multiValued="true"/>
.....
<!-- field to use to determine and enforce document uniqueness. -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent
-->
 <defaultSearchField>text</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="OR"/>

-- 
View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tf4762687.html#a13621332
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 1.2 - Duplicate Documents??

Posted by realw5 <dr...@improvementdirect.com>.
I haven't made any changes to the schema since the intial full-index. Do you
know if there is a way to rebuild the full index in the background, without
having to take down the current live index?

Dan



ryantxu wrote:
> 
>> 
>> Schema.xml
>>  <field name="id" type="string" indexed="true" stored="true"/>
> 
> Have you edited schema.xml since building a full index from scratch?  If 
> so, try rebuilding the index.
> 
> People often get the behavior you describe if the 'id' is a 'text' field.
> 
> ryan
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tf4762687.html#a13629639
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 1.2 - Duplicate Documents??

Posted by cricdigs <cr...@gmail.com>.
I am having the same issue. . Here is my schema.xml entries:

 <field name="id" type="string" indexed="true" stored="true" 
multiValued="false" required="true"/>
 <uniqueKey>id</uniqueKey>

I am using EmbeddedSolr instructions from the current wiki page and setting
the following for my AddUpdateCommand:

      AddUpdateCommand addcmd = new AddUpdateCommand();
      addcmd.allowDups = false;
      addcmd.overwritePending = true;
      addcmd.overwriteCommitted = true;

Thanks!


ryantxu wrote:
> 
>> 
>> Schema.xml
>>  <field name="id" type="string" indexed="true" stored="true"/>
> 
> Have you edited schema.xml since building a full index from scratch?  If 
> so, try rebuilding the index.
> 
> People often get the behavior you describe if the 'id' is a 'text' field.
> 
> ryan
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tp13621332p14531206.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 1.2 - Duplicate Documents??

Posted by Ryan McKinley <ry...@gmail.com>.
> 
> Schema.xml
>  <field name="id" type="string" indexed="true" stored="true"/>

Have you edited schema.xml since building a full index from scratch?  If 
so, try rebuilding the index.

People often get the behavior you describe if the 'id' is a 'text' field.

ryan


Re: SOLR 1.2 - Duplicate Documents??

Posted by Yonik Seeley <yo...@apache.org>.
On Nov 7, 2007 12:30 PM, realw5 <dr...@improvementdirect.com> wrote:
> We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
> could that be a possible source of the issue?

Yes.
Deletes are buffered and carried out in a different phase.

-Yonik

Re: SOLR 1.2 - Duplicate Documents??

Posted by realw5 <dr...@improvementdirect.com>.
I'm currently indexing all documents using the update XML. I have always used
the following when post the documents to solr:

<add overwriteCommitted="true" overwritePending="true">....</add>

I've never had allowDups flag set to true...I'm assuming this is false by
default?

We did have Tomcat crash once (JVM OutOfMem) durning an indexing process,
could that be a possible source of the issue?

Dan
 


hossman wrote:
> 
> : Hey all, I have a fairly odd case of duplicate documents in our solr
> index
> : (See attached xml sample). THe index is roughtly 35k in documents. The
> only
> 
> How did you index those documents?  
> 
> Any chance you inadvertently set the "allowDups=true" attribute when 
> sending them to Solr (possibly becuase of an option whose meaning you 
> didn't fully understand in solrj or solr-ruby etc...)
> 
> 	?
> 
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/SOLR-1.2---Duplicate-Documents---tf4762687.html#a13631742
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR 1.2 - Duplicate Documents??

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Nov 7, 2007, at 12:10 PM, Chris Hostetter wrote:
> : Hey all, I have a fairly odd case of duplicate documents in our  
> solr index
> : (See attached xml sample). THe index is roughtly 35k in  
> documents. The only
>
> How did you index those documents?
>
> Any chance you inadvertently set the "allowDups=true" attribute when
> sending them to Solr (possibly becuase of an option whose meaning you
> didn't fully understand in solrj or solr-ruby etc...)

Just an FYI, solr-ruby at least doesn't offer allowDups as an  
option.  :)

	Erik


Re: SOLR 1.2 - Duplicate Documents??

Posted by Chris Hostetter <ho...@fucit.org>.
: Hey all, I have a fairly odd case of duplicate documents in our solr index
: (See attached xml sample). THe index is roughtly 35k in documents. The only

How did you index those documents?  

Any chance you inadvertently set the "allowDups=true" attribute when 
sending them to Solr (possibly becuase of an option whose meaning you 
didn't fully understand in solrj or solr-ruby etc...)

	?




-Hoss