You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Mark Miller <ma...@gmail.com> on 2009/08/18 05:16:11 UTC

Re: SOLR - extremely strange behavior! Documents disappeared...

I'd say you have a lot of documents that have the same id.
When you add a doc with the same id, first the old one is deleted, then the
new one is added (atomically though).

The deleted docs are not removed from the index immediately though - the doc
id is just marked as deleted.

Over time though, as segments are merged due to hitting triggers while
adding new documents, deletes are removed (which deletes depends on which
segments have been merged).

So if you add a tone of documents over time, many with the same ids, you
would likely see this type of maxDoc, numDoc churn. maxDoc will include
deleted docs while numDoc will not.


-- 
- Mark

http://www.lucidimagination.com

On Mon, Aug 17, 2009 at 11:09 PM, Funtick <fu...@efendi.ca> wrote:

>
> After running an application which heavily uses MD5 HEX-representation as
> <uniqueKey> for SOLR v.1.4-dev-trunk:
>
> 1. After 30 hours:
> 101,000,000 documents added
>
> 2. Commit:
> numDocs = 783,714
> maxDoc = 3,975,393
>
> 3. Upload new docs to SOLR during 1 hour(!!!!!!!), then commit, then
> optimize:
> numDocs=1,281,851
> maxDocs=1,281,851
>
> It looks _extremely_ strange that within an hour I have such a huge
> increase
> with same 'average' document set...
>
> I am suspecting something goes wrong with Lucene buffer flush / index merge
> OR SOLR - Unique ID handling...
>
> According to my own estimates, I should have about 10,000,000 new documents
> now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same
> 'random' documents.
>
> This morning index size was about 4Gb, then suddenly dropped below 0.5 Gb.
> Why? I haven't issued any "commit"...
>
> I am using ramBufferMB=8192
>
>
>
>
>
>
> --
> View this message in context:
> http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>
>

Re: SOLR - extremely strange behavior! Documents disappeared...

Posted by Funtick <fu...@efendi.ca>.
One more hour, and I have +0.5 mlns more (after commit/optimize)

Something strange happening with SOLR buffer flush (if we have single
segment???)... explicit commit prevents it...

30 hours, with index flush, commit: 783,714
+ 1 hour, commit, optimize: 1,281,851
+ 1 hour, commit, optimize: 1,786,552

Same random docs retrieved from web...



Funtick wrote:
> 
> 
> But how to explain that within an hour (after commit) I have had about
> 500,000 new documents, and within 30 hours (after commit) only 783,714?
> 
> Same _random_enough_ documents... 
> 
> BTW, SOLR Console was showing only few hundreds "deletesById" although I
> don't use any deleteById explicitly; only "update" with "allowOverwrite"
> and "uniqueId".
> 
> 
> 
> 
> markrmiller wrote:
>> 
>> I'd say you have a lot of documents that have the same id.
>> When you add a doc with the same id, first the old one is deleted, then
>> the
>> new one is added (atomically though).
>> 
>> The deleted docs are not removed from the index immediately though - the
>> doc
>> id is just marked as deleted.
>> 
>> Over time though, as segments are merged due to hitting triggers while
>> adding new documents, deletes are removed (which deletes depends on which
>> segments have been merged).
>> 
>> So if you add a tone of documents over time, many with the same ids, you
>> would likely see this type of maxDoc, numDoc churn. maxDoc will include
>> deleted docs while numDoc will not.
>> 
>> 
>> -- 
>> - Mark
>> 
>> http://www.lucidimagination.com
>> 
>> On Mon, Aug 17, 2009 at 11:09 PM, Funtick <fu...@efendi.ca> wrote:
>> 
>>>
>>> After running an application which heavily uses MD5 HEX-representation
>>> as
>>> <uniqueKey> for SOLR v.1.4-dev-trunk:
>>>
>>> 1. After 30 hours:
>>> 101,000,000 documents added
>>>
>>> 2. Commit:
>>> numDocs = 783,714
>>> maxDoc = 3,975,393
>>>
>>> 3. Upload new docs to SOLR during 1 hour(!!!!!!!), then commit, then
>>> optimize:
>>> numDocs=1,281,851
>>> maxDocs=1,281,851
>>>
>>> It looks _extremely_ strange that within an hour I have such a huge
>>> increase
>>> with same 'average' document set...
>>>
>>> I am suspecting something goes wrong with Lucene buffer flush / index
>>> merge
>>> OR SOLR - Unique ID handling...
>>>
>>> According to my own estimates, I should have about 10,000,000 new
>>> documents
>>> now... I had 0.5 millions within an hour, and 0.8 mlns within a day;
>>> same
>>> 'random' documents.
>>>
>>> This morning index size was about 4Gb, then suddenly dropped below 0.5
>>> Gb.
>>> Why? I haven't issued any "commit"...
>>>
>>> I am using ramBufferMB=8192
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
>>> Sent from the Solr - User mailing list archive at Nabble.com.
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017967.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: SOLR - extremely strange behavior! Documents disappeared...

Posted by Funtick <fu...@efendi.ca>.

But how to explain that within an hour (after commit) I have had about
500,000 new documents, and within 30 hours (after commit) only 1,300,000?

Same _random_enough_ documents... 

BTW, SOLR Console was showing only few hundreds "deletesById" although I
don't use any deleteById explicitly; only "update" with "allowOverwrite" and
"uniqueId".




markrmiller wrote:
> 
> I'd say you have a lot of documents that have the same id.
> When you add a doc with the same id, first the old one is deleted, then
> the
> new one is added (atomically though).
> 
> The deleted docs are not removed from the index immediately though - the
> doc
> id is just marked as deleted.
> 
> Over time though, as segments are merged due to hitting triggers while
> adding new documents, deletes are removed (which deletes depends on which
> segments have been merged).
> 
> So if you add a tone of documents over time, many with the same ids, you
> would likely see this type of maxDoc, numDoc churn. maxDoc will include
> deleted docs while numDoc will not.
> 
> 
> -- 
> - Mark
> 
> http://www.lucidimagination.com
> 
> On Mon, Aug 17, 2009 at 11:09 PM, Funtick <fu...@efendi.ca> wrote:
> 
>>
>> After running an application which heavily uses MD5 HEX-representation as
>> <uniqueKey> for SOLR v.1.4-dev-trunk:
>>
>> 1. After 30 hours:
>> 101,000,000 documents added
>>
>> 2. Commit:
>> numDocs = 783,714
>> maxDoc = 3,975,393
>>
>> 3. Upload new docs to SOLR during 1 hour(!!!!!!!), then commit, then
>> optimize:
>> numDocs=1,281,851
>> maxDocs=1,281,851
>>
>> It looks _extremely_ strange that within an hour I have such a huge
>> increase
>> with same 'average' document set...
>>
>> I am suspecting something goes wrong with Lucene buffer flush / index
>> merge
>> OR SOLR - Unique ID handling...
>>
>> According to my own estimates, I should have about 10,000,000 new
>> documents
>> now... I had 0.5 millions within an hour, and 0.8 mlns within a day; same
>> 'random' documents.
>>
>> This morning index size was about 4Gb, then suddenly dropped below 0.5
>> Gb.
>> Why? I haven't issued any "commit"...
>>
>> I am using ramBufferMB=8192
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017728.html
>> Sent from the Solr - User mailing list archive at Nabble.com.
>>
>>
> 
> 

-- 
View this message in context: http://www.nabble.com/SOLR-%3CuniqueKey%3E---extremely-strange-behavior%21-Documents-disappeared...-tp25017728p25017826.html
Sent from the Solr - User mailing list archive at Nabble.com.