You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Pascal Dimassimo <th...@hotmail.com> on 2010/02/19 20:22:09 UTC

Documents disappearing

Hi,

I have encounter a situation that I can't explain. We are indexing documents
that are often duplicates so we activated deduplication like this:

<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">true</bool>
      <str name="signatureField">signature</str>
      <str name="fields">title,text</str>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>

What I can't explain is that when I look at the documents count in the log,
I see documents disappearing.

11:24:23 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
14:04:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10
14:17:07 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42
14:25:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1
14:47:12 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12
15:17:22 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13
15:47:31 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19
16:17:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13
16:38:17 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10
16:39:10 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1
16:47:40 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46
16:51:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74
17:02:13 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102
17:17:41 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8

11:24 was the time at which Solr was started that day. Around 13:30, we
started the indexation.

At some point during the indexation, I notice that a batch a documents were
resend (i.e, documents with the same id field were sent again to the index).
And according to the log, NO delete was sent to Solr.

I understand that if I send duplicates (either documents with the same id or
with the same signature), the count of documents should stay the same. But
how can we explain that it is lowering? What are the possible causes of this
behavior?

Thanks! 
-- 
View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents disappearing

Posted by Chris Hostetter <ho...@fucit.org>.
: A quick check did show me a couple of duplicates, but if I understand
: correctly, even if two different process send the same document, the last
: one should update the previous. If I send the same documents 10 times, in
: the end, it should only be in my index once, no?

it should yes ... i didn't say i could explain your problem, i'm just 
trying to speculate about things that might give us insight into figureing 
out if/where a bug exists.

the only thing i can possibly think of that would cause a situation like 
this (where the number of documents decreases w/o any deletes happening) 
is if some of the "add" commands use overwrite="false" and some use 
overwrite="true" ... in that 
situation, you might get 10 docs added with the same uniqueKey 
value using overwrite="false" and so you'll have 10 docs in your index.  
then you might index one more doc with the same uniqueKey value, but this 
time using overwrite="true" and that one document will overwrite all 10 of 
the previous documents, causing your doc count to decrease from 10 to 1.

But nothing in your description of how you are using Solr gimplies that 
you were doing this, hence my question of what exactly your indexing code 
looks like.

My best guess is that maybe the deduplication UpdateProcessors hav a bug 
in them, but w/o a reproducible test case demonstrating hte problem it 
will be nearly impossible to even know where (or if that's actaully the 
problem at all)



-Hoss


Re: Documents disappearing

Posted by Pascal Dimassimo <th...@hotmail.com>.
Hi,

hossman wrote:
> 
> : We index using 4 processes that read from a queue of documents. Each
> process
> : send one document at a time to the /update handler.
> 
> Hmmm.. then you should have a message from the LogUpdateProcessorFactory 
> for every individual "add" command that was recieved ... did you crunch 
> those to see if anything odd popped up (ie: duplicated IDs)
> 
> what did the "start commit" log messages look like?
> 
> (FWIW: I have no hunches as to what caused that behavior, i'm just 
> scrounging for more data)
> 

A quick check did show me a couple of duplicates, but if I understand
correctly, even if two different process send the same document, the last
one should update the previous. If I send the same documents 10 times, in
the end, it should only be in my index once, no?

The "start commit" message is always:
start
commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)


hossman wrote:
> 
> : Yes, I double checked that no delete occur. Since that indexation, I
> : re-index the same set of documents twice and we always end up with 7725
> : documents, but it did not show that ~10000 documents count that we saw
> the
> : first time. But the difference between the first indexation and the
> others
> : was that the first time, the indexation last a couple of hours because
> the
> : documents were not always accessible in our document queue. The others
> 
> Hmmm... what exactly does yout indexing code do when the documents aren't 
> available?  ... and what happens if you forcibly commit in the middle of 
> reindexing (to see some of those counts again)
> 

If no document is available, the threads are sleeping. If a commit is send
manually during the re-indexation, it just commit what has been sent to the
index so far.

I will redo the test with the same documents and in the same conditions as
in our first indexation to see if the counts will be the same again.

Again, thanks a lot for your help.


-- 
View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27794641.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents disappearing

Posted by Chris Hostetter <ho...@fucit.org>.
 
: We index using 4 processes that read from a queue of documents. Each process
: send one document at a time to the /update handler.

Hmmm.. then you should have a message from the LogUpdateProcessorFactory 
for every individual "add" command that was recieved ... did you crunch 
those to see if anything odd popped up (ie: duplicated IDs)

what did the "start commit" log messages look like?

(FWIW: I have no hunches as to what caused that behavior, i'm just 
scrounging for more data)

: Yes, I double checked that no delete occur. Since that indexation, I
: re-index the same set of documents twice and we always end up with 7725
: documents, but it did not show that ~10000 documents count that we saw the
: first time. But the difference between the first indexation and the others
: was that the first time, the indexation last a couple of hours because the
: documents were not always accessible in our document queue. The others

Hmmm... what exactly does yout indexing code do when the documents aren't 
available?  ... and what happens if you forcibly commit in the middle of 
reindexing (to see some of those counts again)

: About the newSearcher warming query, it is a typo in the config. It should
: have been 'qt'. Thanks for this one!

Even if you change wt to qt that won't make the query make sense (q=*:* 
isn't a very useful query string when using qt=dismax)


-Hoss


Re: Documents disappearing

Posted by Pascal Dimassimo <th...@hotmail.com>.
Hoss,

Thanks for your answers. You are absolutely right, I should have provided
you more details. 

We index using 4 processes that read from a queue of documents. Each process
send one document at a time to the /update handler.

Yes, I double checked that no delete occur. Since that indexation, I
re-index the same set of documents twice and we always end up with 7725
documents, but it did not show that ~10000 documents count that we saw the
first time. But the difference between the first indexation and the others
was that the first time, the indexation last a couple of hours because the
documents were not always accessible in our document queue. The others
times, the documents were all available so it took around 20 minutes to
re-index all documents. So there we no time for an auto-commit to happen
during the others indexation so the log never shows the newSearcher warming
query that I use as a document count. 

About the newSearcher warming query, it is a typo in the config. It should
have been 'qt'. Thanks for this one!

In my schema.xml, I have define the id ans signature fields like this:
<field name="id" type="string" indexed="true" stored="true" required="true"
/>
<field name="signature" type="string" indexed="true" stored="true"/>
...
<uniqueKey>id</uniqueKey>
<defaultSearchField>fulltext</defaultSearchField>


And here is our solrconfig.xml:
<?xml version="1.0" encoding="UTF-8" ?>

<config>
 
<abortOnConfigurationError>${solr.abortOnConfigurationError:true}</abortOnConfigurationError>

  <indexDefaults>
    <useCompoundFile>false</useCompoundFile>
    <mergeFactor>10</mergeFactor>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <writeLockTimeout>1000</writeLockTimeout>
    <commitLockTimeout>10000</commitLockTimeout>
    <lockType>single</lockType>
  </indexDefaults>

  <mainIndex>
    <useCompoundFile>false</useCompoundFile>
    <ramBufferSizeMB>32</ramBufferSizeMB>
    <mergeFactor>10</mergeFactor>
    <maxMergeDocs>2147483647</maxMergeDocs>
    <maxFieldLength>10000</maxFieldLength>
    <unlockOnStartup>false</unlockOnStartup>
  </mainIndex>

  <updateHandler class="solr.DirectUpdateHandler2">
  	<!-- Perform a <commit/> automatically under certain conditions:
         maxDocs - number of updates since last commit is greater than this
         maxTime - oldest uncommited update (in ms) is this long ago
    -->
  	<autoCommit>
		<maxDocs>10000</maxDocs>
		<maxTime>1800000</maxTime>
	</autoCommit>
  </updateHandler>


  <query>
    <maxBooleanClauses>1024</maxBooleanClauses>

    <filterCache
      class="solr.FastLRUCache"
      size="1048576"
      initialSize="4096"
      autowarmCount="1024"/>

    <queryResultCache
      class="solr.LRUCache"
      size="16384"
      initialSize="4096"
      autowarmCount="128"/>

    <documentCache
      class="solr.FastLRUCache"
      size="1048576"
      initialSize="512"
      autowarmCount="0"/>

    <enableLazyFieldLoading>true</enableLazyFieldLoading>
    <queryResultWindowSize>50</queryResultWindowSize>
    <queryResultMaxDocsCached>200</queryResultMaxDocsCached>
    <HashDocSet maxSize="3000" loadFactor="0.75"/>

    <listener event="newSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst>
			<str name="q">*:*</str>
			<str name="sort">original_date desc</str>
		</lst>
		<lst>
			<str name="q">*:*</str>
			<str name="wt">dismax</str>
		</lst>
		<lst>
			<str name="q">*:*</str>
			<str name="facet">true</str>			
			<str name="facet.field">source</str>
			<str name="facet.field">author</str>
			<str name="facet.field">type</str>
			<str name="facet.field">site</str>
		</lst>
      </arr>
    </listener>

    <listener event="firstSearcher" class="solr.QuerySenderListener">
      <arr name="queries">
        <lst>
			<str name="q">*:*</str>
			<str name="sort">original_date desc</str>
		</lst>
		<lst>
			<str name="q">*:*</str>
			<str name="wt">dismax</str>
		</lst>
		<lst>
			<str name="q">*:*</str>
			<str name="facet">true</str>			
			<str name="facet.field">source</str>
			<str name="facet.field">author</str>
			<str name="facet.field">type</str>
			<str name="facet.field">site</str>
		</lst>
      </arr>
    </listener>

    <useColdSearcher>false</useColdSearcher>
    <maxWarmingSearchers>2</maxWarmingSearchers>
  </query>

  <requestDispatcher handleSelect="true" >
    <requestParsers enableRemoteStreaming="false"
multipartUploadLimitInKB="2048" />
    <httpCaching lastModifiedFrom="openTime" etagSeed="Solr">
    </httpCaching>
  </requestDispatcher>
      
  <requestHandler name="standard" class="solr.SearchHandler" default="true">
    <!-- default values for query parameters -->
     <lst name="defaults">
        <str name="echoParams">explicit</str>
        <str name="spellcheck.extendedResults">true</str>
        <str name="spellcheck.count">5</str>
        <str name="spellcheck.collate">true</str>
        <str name="spellcheck.onlyMorePopular">true</str>
     </lst>
	 
	 <arr name="last-components">
     	<str>spellcheck</str>     	
     </arr>
  </requestHandler>

<requestHandler name="dismax" class="solr.SearchHandler" > 
   <lst name="defaults">
    <str name="defType">dismax</str>
    <str name="echoParams">explicit</str>
    <float name="tie">0.2</float>
    <str name="qf">
       fulltext^1.2 title^2.3 text^1.2
    </str>
    <str name="pf">
       fulltext^1.2 title^1.8 text^1.2
    </str>
    <str name="bf">recip(rord(original_date),1,10000,10000)^100</str>
    <str name="bq">original_date:[NOW-10DAY TO *]^2</str>
    <str name="fl">
id,title,text,author,original_date,source,section
    </str>
    <str name="mm">
       2&lt;100% 3&lt;-1 4&lt;-2 8&lt;60%
    </str>
    <int name="ps">100</int>
    <str name="q.alt">*:*</str>
    <!-- example highlighter config, enable per-query with hl=true -->

    <str name="hl.fl">text features name</str>
    <!-- for this field, we want no fragmenting, just highlighting -->
    <str name="f.name.hl.fragsize">0</str>
    <!-- instructs Solr to return the field itself if no query terms are
         found
-->
    <str name="f.name.hl.alternateField">name</str>
    <str name="f.text.hl.fragmenter">regex</str> <!-- defined below -->
    
    <str name="spellcheck.extendedResults">true</str>
    <str name="spellcheck.count">5</str>
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck.onlyMorePopular">true</str>
   </lst>
   
   <arr name="last-components">
    <str>spellcheck</str>
    <str>facetcleaner</str>
    <str>docreader</str>
    <str>queryelevation</str>
    <str>didyoumean</str>
    <str>likethis</str>
   </arr>
 </requestHandler>
  
  <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <str name="queryAnalyzerFieldType">textSpell</str>
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">spellchecker</str>
      <str name="spellcheckIndexDir">./spellchecker1</str>
    </lst>
  </searchComponent>
  
  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
  <requestHandler name="/analysis" class="solr.AnalysisRequestHandler"
startup="lazy"/>
  <requestHandler name="/admin/"
class="org.apache.solr.handler.admin.AdminHandlers" />
  
  <requestHandler name="/admin/ping" class="PingRequestHandler">
    <lst name="defaults">
      <str name="qt">standard</str>
      <str name="q">solrpingquery</str>
      <str name="echoParams">all</str>
    </lst>
  </requestHandler>
    
  <requestHandler name="/debug/dump" class="solr.DumpRequestHandler"
startup="lazy">
    <lst name="defaults">
     <str name="echoParams">explicit</str> <!-- for all params (including
the default etc) use: 'all' -->
     <str name="echoHandler">true</str>
    </lst>
  </requestHandler>
  
  <requestHandler name="/mlt"
class="org.apache.solr.handler.MoreLikeThisHandler" />
  
  <highlighting>
   <fragmenter name="gap" class="org.apache.solr.highlight.GapFragmenter"
default="true">
    <lst name="defaults">
     <int name="hl.fragsize">100</int>
    </lst>
   </fragmenter>
   <fragmenter name="regex"
class="org.apache.solr.highlight.RegexFragmenter">
    <lst name="defaults">
      <!-- slightly smaller fragsizes work better because of slop -->
      <int name="hl.fragsize">70</int>
      <!-- allow 50% slop on fragment sizes -->
      <float name="hl.regex.slop">0.5</float> 
      <!-- a basic sentence pattern -->
      <str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
    </lst>
   </fragmenter>
   <formatter name="html" class="org.apache.solr.highlight.HtmlFormatter"
default="true">
    <lst name="defaults">
     <str name="hl.simple.pre"><![CDATA[<em>]]></str>
     <str name="hl.simple.post"><![CDATA[</em>]]></str>
    </lst>
   </formatter>
  </highlighting>

  <queryResponseWriter name="xslt"
class="org.apache.solr.request.XSLTResponseWriter">
    <int name="xsltCacheLifetimeSeconds">5</int>
  </queryResponseWriter> 
     
  <admin>
    <defaultQuery>solr</defaultQuery>
  </admin>
  
  <requestHandler name="/replication" class="solr.ReplicationHandler" >
	 <lst name="master">
	    <str name="enable">${enable.master:false}</str>
	    <str name="replicateAfter">startup</str> 
	    <str name="replicateAfter">commit</str>
	    <str name="replicateAfter">optimize</str>
	 </lst>
	 <lst name="slave">
	    <str name="enable">${enable.slave:false}</str> 
	    <str name="masterUrl">${slave.master.url}</str>
	    <str name="pollInterval">${slave.poll.interval}</str>
	 </lst>
  </requestHandler>
  
  <updateRequestProcessorChain>
    <processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">false</bool>
      <str name="signatureField">signature</str>
      <str name="fields">title,text</str>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
    </processor>    
    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
  </updateRequestProcessorChain>

</config>

Again, thanks for your help!


hossman wrote:
> 
> 
> : I have encounter a situation that I can't explain. We are indexing
> documents
> : that are often duplicates so we activated deduplication like this:
> 
> FWIW: w/o providing us more info about what your schema looks like, and 
> how you are indexing documents, all we can do is speculate about some of 
> hte possible causes of your problems -- for all we know you don't have 
> your uniqueKey configured properly, or have something in DIH configured to 
> do deletes on delta imports, etc...  We need all the facts to make 
> informed suggestions.
> 
> : What I can't explain is that when I look at the documents count in the
> log,
> : I see documents disappearing.
> : 
> : 11:24:23 INFO  - [myindex] webapp=null path=null
> : params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
> 
> 1) it looks like you only included the "newSearcher" related warming query 
> log messages in your email ... i assume you double checked that there were 
> no "delete" messages logged by the LogUpdateProcessor ?
> 
> 2) that's a fairly non-sensical warming query ... do you really have a 
> queryResponseWriter registered with the name "dismax" (it's typically used 
> as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what 
> your default requestHandler declaration looks like, its totally possible 
> that the number you are seeing has nothing to do with the totaly number of 
> docs in your index, and instead just indicates how many docs match the 
> litteral string "*:*" in your default seearch fielt (or some set of query 
> fields if you are using dismax as the default QParser) which can 
> certainly change as you update existing documents..
> 
> As i said: full configs would make it a lot easier to help clear up what 
> you are seeing.
> 
> 
> 
> -Hoss
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27714221.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: Documents disappearing

Posted by Chris Hostetter <ho...@fucit.org>.
: I have encounter a situation that I can't explain. We are indexing documents
: that are often duplicates so we activated deduplication like this:

FWIW: w/o providing us more info about what your schema looks like, and 
how you are indexing documents, all we can do is speculate about some of 
hte possible causes of your problems -- for all we know you don't have 
your uniqueKey configured properly, or have something in DIH configured to 
do deletes on delta imports, etc...  We need all the facts to make 
informed suggestions.

: What I can't explain is that when I look at the documents count in the log,
: I see documents disappearing.
: 
: 11:24:23 INFO  - [myindex] webapp=null path=null
: params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0

1) it looks like you only included the "newSearcher" related warming query 
log messages in your email ... i assume you double checked that there were 
no "delete" messages logged by the LogUpdateProcessor ?

2) that's a fairly non-sensical warming query ... do you really have a 
queryResponseWriter registered with the name "dismax" (it's typically used 
as either a RequestHandler (qt) or QParser (defType) ... w/o knowing what 
your default requestHandler declaration looks like, its totally possible 
that the number you are seeing has nothing to do with the totaly number of 
docs in your index, and instead just indicates how many docs match the 
litteral string "*:*" in your default seearch fielt (or some set of query 
fields if you are using dismax as the default QParser) which can 
certainly change as you update existing documents..

As i said: full configs would make it a lot easier to help clear up what 
you are seeing.



-Hoss


Re: Documents disappearing

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Pascal,

Look at that difference between numDocs and maxDocs.  That delta represents deleted docs.  Maybe there is something deleting your docs after all!

Otis
----Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



----- Original Message ----
> From: Pascal Dimassimo <th...@hotmail.com>
> To: solr-user@lucene.apache.org
> Sent: Fri, February 19, 2010 3:50:26 PM
> Subject: RE: Documents disappearing
> 
> 
> Using LukeRequestHandler, I see:
> 
> 7725
> 28099
> 758826
> 1266355690710
> false
> true
> true
> 
> org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index
> 
> 
> I will copy the index to my local machine so I can open it with luke. Should
> I look for something specific?
> 
> Thanks!
> 
> 
> ANKITBHATNAGAR wrote:
> > 
> > Try inspecting your index with luke
> > 
> > 
> > Ankit
> > 
> > 
> > -----Original Message-----
> > From: Pascal Dimassimo [mailto:thesuperdim@hotmail.com] 
> > Sent: Friday, February 19, 2010 2:22 PM
> > To: solr-user@lucene.apache.org
> > Subject: Documents disappearing
> > 
> > 
> > Hi,
> > 
> > I have encounter a situation that I can't explain. We are indexing
> > documents
> > that are often duplicates so we activated deduplication like this:
> > 
> > 
> > class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
> >      true
> >      true
> >      signature
> >      title,text
> >      
> > name="signatureClass">org.apache.solr.update.processor.Lookup3Signature
> > 
> > 
> > What I can't explain is that when I look at the documents count in the
> > log,
> > I see documents disappearing.
> > 
> > 11:24:23 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
> > 14:04:24 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10
> > 14:17:07 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42
> > 14:25:42 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1
> > 14:47:12 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12
> > 15:17:22 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13
> > 15:47:31 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19
> > 16:17:42 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13
> > 16:38:17 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10
> > 16:39:10 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1
> > 16:47:40 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46
> > 16:51:24 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74
> > 17:02:13 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102
> > 17:17:41 INFO  - [myindex] webapp=null path=null
> > params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8
> > 
> > 11:24 was the time at which Solr was started that day. Around 13:30, we
> > started the indexation.
> > 
> > At some point during the indexation, I notice that a batch a documents
> > were
> > resend (i.e, documents with the same id field were sent again to the
> > index).
> > And according to the log, NO delete was sent to Solr.
> > 
> > I understand that if I send duplicates (either documents with the same id
> > or
> > with the same signature), the count of documents should stay the same. But
> > how can we explain that it is lowering? What are the possible causes of
> > this
> > behavior?
> > 
> > Thanks! 
> > -- 
> > View this message in context:
> > http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
> > Sent from the Solr - User mailing list archive at Nabble.com.
> > 
> > 
> > 
> 
> -- 
> View this message in context: 
> http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html
> Sent from the Solr - User mailing list archive at Nabble.com.


RE: Documents disappearing

Posted by Pascal Dimassimo <th...@hotmail.com>.
Using LukeRequestHandler, I see:

<int name="numDocs">7725</int>
<int name="maxDoc">28099</int>
<int name="numTerms">758826</int>
<long name="version">1266355690710</long>
<bool name="optimized">false</bool>
<bool name="current">true</bool>
<bool name="hasDeletions">true</bool>
<str name="directory">
org.apache.lucene.store.NIOFSDirectory:org.apache.lucene.store.NIOFSDirectory@/opt/solr/myindex/data/index
</str>

I will copy the index to my local machine so I can open it with luke. Should
I look for something specific?

Thanks!


ANKITBHATNAGAR wrote:
> 
> Try inspecting your index with luke
> 
> 
> Ankit
> 
> 
> -----Original Message-----
> From: Pascal Dimassimo [mailto:thesuperdim@hotmail.com] 
> Sent: Friday, February 19, 2010 2:22 PM
> To: solr-user@lucene.apache.org
> Subject: Documents disappearing
> 
> 
> Hi,
> 
> I have encounter a situation that I can't explain. We are indexing
> documents
> that are often duplicates so we activated deduplication like this:
> 
> <processor
> class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
>       <bool name="enabled">true</bool>
>       <bool name="overwriteDupes">true</bool>
>       <str name="signatureField">signature</str>
>       <str name="fields">title,text</str>
>       <str
> name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
> </processor>
> 
> What I can't explain is that when I look at the documents count in the
> log,
> I see documents disappearing.
> 
> 11:24:23 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
> 14:04:24 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10
> 14:17:07 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42
> 14:25:42 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1
> 14:47:12 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12
> 15:17:22 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13
> 15:47:31 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19
> 16:17:42 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13
> 16:38:17 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10
> 16:39:10 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1
> 16:47:40 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46
> 16:51:24 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74
> 17:02:13 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102
> 17:17:41 INFO  - [myindex] webapp=null path=null
> params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8
> 
> 11:24 was the time at which Solr was started that day. Around 13:30, we
> started the indexation.
> 
> At some point during the indexation, I notice that a batch a documents
> were
> resend (i.e, documents with the same id field were sent again to the
> index).
> And according to the log, NO delete was sent to Solr.
> 
> I understand that if I send duplicates (either documents with the same id
> or
> with the same signature), the count of documents should stay the same. But
> how can we explain that it is lowering? What are the possible causes of
> this
> behavior?
> 
> Thanks! 
> -- 
> View this message in context:
> http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
> Sent from the Solr - User mailing list archive at Nabble.com.
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27660077.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Documents disappearing

Posted by Ankit Bhatnagar <ab...@vantage.com>.
Try inspecting your index with luke


Ankit


-----Original Message-----
From: Pascal Dimassimo [mailto:thesuperdim@hotmail.com] 
Sent: Friday, February 19, 2010 2:22 PM
To: solr-user@lucene.apache.org
Subject: Documents disappearing


Hi,

I have encounter a situation that I can't explain. We are indexing documents
that are often duplicates so we activated deduplication like this:

<processor
class="org.apache.solr.update.processor.SignatureUpdateProcessorFactory">
      <bool name="enabled">true</bool>
      <bool name="overwriteDupes">true</bool>
      <str name="signatureField">signature</str>
      <str name="fields">title,text</str>
      <str
name="signatureClass">org.apache.solr.update.processor.Lookup3Signature</str>
</processor>

What I can't explain is that when I look at the documents count in the log,
I see documents disappearing.

11:24:23 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=0 status=0 QTime=0
14:04:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=4065 status=0 QTime=10
14:17:07 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=6499 status=0 QTime=42
14:25:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7629 status=0 QTime=1
14:47:12 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=10140 status=0 QTime=12
15:17:22 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=10861 status=0 QTime=13
15:47:31 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=9852 status=0 QTime=19
16:17:42 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=8112 status=0 QTime=13
16:38:17 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=10
16:39:10 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=1
16:47:40 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=46
16:51:24 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=74
17:02:13 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=102
17:17:41 INFO  - [myindex] webapp=null path=null
params={event=newSearcher&q=*:*&wt=dismax} hits=7725 status=0 QTime=8

11:24 was the time at which Solr was started that day. Around 13:30, we
started the indexation.

At some point during the indexation, I notice that a batch a documents were
resend (i.e, documents with the same id field were sent again to the index).
And according to the log, NO delete was sent to Solr.

I understand that if I send duplicates (either documents with the same id or
with the same signature), the count of documents should stay the same. But
how can we explain that it is lowering? What are the possible causes of this
behavior?

Thanks! 
-- 
View this message in context: http://old.nabble.com/Documents-disappearing-tp27659047p27659047.html
Sent from the Solr - User mailing list archive at Nabble.com.