You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Derek Poh <dp...@globalsources.com> on 2015/09/01 03:43:11 UTC

Re: 'missing content stream' issuing expungeDeletes=true

Hi Upayavira

In fact we are using optimize currently but was advised to use expunge 
deletes as it is less resource intensive.
So expunge deletes will only remove deleted documents, it will not merge 
all index segments into one?

If we don't use optimize, the deleted documents in the index will affect 
the scores (with docFreq=2) of the matched documents which will affect 
the relevancy of the search result.

Derek

On 9/1/2015 12:05 AM, Upayavira wrote:
> If you really must expunge deletes, use optimize. That will merge all
> index segments into one, and in the process will remove any deleted
> documents.
>
> Why do you need to expunge deleted documents anyway? It is generally
> done in the background for you, so you shouldn't need to worry about it.
>
> Upayavira
>
> On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>> Hi,
>>
>> The below curl command worked without error, you can try.
>>
>> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>> expungeDeletes="true"/>'
>>
>> However, after executing this, I could still see same deleted counts on
>> dashboard.  Deleted Docs:6
>> I am not sure whether that means,  the command did not take effect or it
>> took effect but did not reflect on dashboard view.
>>
>>
>>
>>
>>
>> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>> wrote:
>>
>>> Hi
>>>
>>> I tried doing a expungeDeletes=true with the following but get the message
>>> 'missing content stream'. What am I missing? I need to provide additional
>>> parameters?
>>>
>>> curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>>> ';
>>>
>>> Thanks,
>>> Derek
>>>
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>>
>>>
>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: 'missing content stream' issuing expungeDeletes=true

Posted by Erick Erickson <er...@gmail.com>.
bq: When we found out the document has a docfreq
of 2, we did a query on the document's product id and
indeed 2 documents were returned.
We suspect 1 of them is deleted but not remove from the index.

This is totally inconsistent with how Solr works _if_ these
documents had the same value for whatever field is defined
in your schema.xml as the <uniqueKey>, usually "id".
So how did you do your query? Through Solr or looking
at things at a low level with Lucene? It should not matter
whether you re-index from scratch or not. So I suspect there's
something else going on here.

bq: Each document (or product record) is unqiue in the collection.

then boosting on the unique value is probably not what you
really want to do. You have to already _know_ the values here
and you just want them at the top. And they're unique per doc.
Could you use the QueryElevationCompoment? The original
intent of that component was statically defined, but you can
also provide a set of IDs as HTTP parameters, see:
https://cwiki.apache.org/confluence/display/solr/The+Query+Elevation+Component
down near the bottom of that page.

Now, all that said if you're indexing only occasionally (say
once a day), then optimizing is not a bad way to go. And having
only 6M docs means it won't take all that long.

Best,
Erick

On Wed, Sep 2, 2015 at 12:59 AM, Derek Poh <dp...@globalsources.com> wrote:
> There are around 6+ millions documents in the collection.
>
> Each document (or product record) is unqiue in the collection.
> When we found out the document has a docfreq of 2, we did a query on the
> document's product id and indeed 2 documents were returned.
> We suspect 1 of them is deleted but not remove from the index. We try
> optimizing. Only 1 document is return when we query again and the document
> docreq is 1.
>
> We checked the source data and the document is not duplicated.
> It could be the way we index (full index every time) that result in this
> scenario of having 2 of the same document in the index.
>
> On 9/2/2015 12:11 PM, Erick Erickson wrote:
>>
>> How many document total in your corpus? And how many do you
>> intend to have?
>>
>> My point is that if you are testing this with a small corpus, the results
>> are very likely different than when you test on a reasonable corpus.
>> So if you expect your "real" index will contain many more docs than
>> what you're testing, this is likely a red herring.
>>
>> But something isn't making a lot of sense here. You say you've traced it
>> to having a docfreq of 2 that changes to 1. But that means that the
>> value is unique in your entire corpus, which kind of indicates you're
>> trying to boost on unique values which is unusual.
>>
>> If you're confident in your model though, the only way to guarantee
>> what you want is to optimize/expungeDeletes.
>>
>> Best,
>> Erick
>>
>> On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh <dp...@globalsources.com> wrote:
>>>
>>> Erick
>>>
>>> Yes, we see documents changing their position in the list due to having
>>> deleted docs.
>>> In our searchresult,weapply higher boost (bq) to a group of matched
>>> documents to have them display at the top tier of the result.
>>> At times 1 or 2 of these documentsare not return in the top tier, they
>>> are
>>> relegateddown to the lower tierof the result. Wediscovered that these
>>> documents have a lower score due to docFreq=2.
>>> After we do an optimize, these 1-2 documents are back in the top tier
>>> result
>>> order and their docFreqis 1.
>>>
>>>
>>>
>>> On 9/1/2015 11:40 PM, Erick Erickson wrote:
>>>>
>>>> Derek:
>>>>
>>>> Why do you care? What evidence do you have that this matters
>>>> _practically_?
>>>>
>>>> If you've look at scoring with a small number of documents, you'll see
>>>> significant
>>>> differences due to deleted documents. In most cases, as you get a larger
>>>> number
>>>> of documents the ranking of documents in an index with no deletions .vs.
>>>> indexes
>>>> that have deletions is usually not noticeable.
>>>>
>>>> I'm suggesting that this is a red herring. Your specific situation may
>>>> be different
>>>> of course, but since scoring is really only about ranking docs
>>>> relative to each other,
>>>> unless the relative positions change enough to be noticeable it's not a
>>>> problem.
>>>>
>>>> Note that I'm saying "relative rankings", NOT "absolute score". Document
>>>> scores
>>>> have no meaning outside comparisons to other docs _in the same query_.
>>>> So
>>>> unless you see documents changing their position in the list due to
>>>> having deleted
>>>> docs, it's not worth spending time on IMO.
>>>>
>>>> Best,
>>>> Erick
>>>>
>>>> On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <uv...@odoko.co.uk> wrote:
>>>>>
>>>>> I wonder if this resolves it [1]. It has been applied to trunk, but not
>>>>> to the 5.x release branch.
>>>>>
>>>>> If you needed it in 5.x, I wonder if there's a way that particular
>>>>> choice could be made configurable.
>>>>>
>>>>> Upayavira
>>>>>
>>>>> [1] https://issues.apache.org/jira/browse/LUCENE-6711
>>>>> On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
>>>>>>
>>>>>> Hi Upayavira
>>>>>>
>>>>>> In fact we are using optimize currently but was advised to use expunge
>>>>>> deletes as it is less resource intensive.
>>>>>> So expunge deletes will only remove deleted documents, it will not
>>>>>> merge
>>>>>> all index segments into one?
>>>>>>
>>>>>> If we don't use optimize, the deleted documents in the index will
>>>>>> affect
>>>>>> the scores (with docFreq=2) of the matched documents which will affect
>>>>>> the relevancy of the search result.
>>>>>>
>>>>>> Derek
>>>>>>
>>>>>> On 9/1/2015 12:05 AM, Upayavira wrote:
>>>>>>>
>>>>>>> If you really must expunge deletes, use optimize. That will merge all
>>>>>>> index segments into one, and in the process will remove any deleted
>>>>>>> documents.
>>>>>>>
>>>>>>> Why do you need to expunge deleted documents anyway? It is generally
>>>>>>> done in the background for you, so you shouldn't need to worry about
>>>>>>> it.
>>>>>>>
>>>>>>> Upayavira
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> The below curl command worked without error, you can try.
>>>>>>>>
>>>>>>>> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>>>>>>>> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>>>>>>>> expungeDeletes="true"/>'
>>>>>>>>
>>>>>>>> However, after executing this, I could still see same deleted counts
>>>>>>>> on
>>>>>>>> dashboard.  Deleted Docs:6
>>>>>>>> I am not sure whether that means,  the command did not take effect
>>>>>>>> or
>>>>>>>> it
>>>>>>>> took effect but did not reflect on dashboard view.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi
>>>>>>>>>
>>>>>>>>> I tried doing a expungeDeletes=true with the following but get the
>>>>>>>>> message
>>>>>>>>> 'missing content stream'. What am I missing? I need to provide
>>>>>>>>> additional
>>>>>>>>> parameters?
>>>>>>>>>
>>>>>>>>> curl
>>>>>>>>>
>>>>>>>>> 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>>>>>>>>> ';
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Derek
>>>>>>>>>
>>>>>>>>> ----------------------
>>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>>> This e-mail (including any attachments) may contain confidential
>>>>>>>>> and/or
>>>>>>>>> privileged information. If you are not the intended recipient or
>>>>>>>>> have
>>>>>>>>> received this e-mail in error, please inform the sender immediately
>>>>>>>>> and
>>>>>>>>> delete this e-mail (including any attachments) from your computer,
>>>>>>>>> and you
>>>>>>>>> must not use, disclose to anyone else or copy this e-mail
>>>>>>>>> (including
>>>>>>>>> any
>>>>>>>>> attachments), whether in whole or in part.
>>>>>>>>> This e-mail and any reply to it may be monitored for security,
>>>>>>>>> legal,
>>>>>>>>> regulatory compliance and/or other appropriate reasons.
>>>>>>>>>
>>>>>>>>>
>>>>>> ----------------------
>>>>>> CONFIDENTIALITY NOTICE
>>>>>>
>>>>>> This e-mail (including any attachments) may contain confidential
>>>>>> and/or
>>>>>> privileged information. If you are not the intended recipient or have
>>>>>> received this e-mail in error, please inform the sender immediately
>>>>>> and
>>>>>> delete this e-mail (including any attachments) from your computer, and
>>>>>> you must not use, disclose to anyone else or copy this e-mail
>>>>>> (including
>>>>>> any attachments), whether in whole or in part.
>>>>>>
>>>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>>>> regulatory compliance and/or other appropriate reasons.
>>>>
>>>>
>>>
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and
>>> you
>>> must not use, disclose to anyone else or copy this e-mail (including any
>>> attachments), whether in whole or in part.
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>>
>>
>
> ----------------------
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.
>

Re: 'missing content stream' issuing expungeDeletes=true

Posted by Derek Poh <dp...@globalsources.com>.
There are around 6+ millions documents in the collection.

Each document (or product record) is unqiue in the collection.
When we found out the document has a docfreq of 2, we did a query on the 
document's product id and indeed 2 documents were returned.
We suspect 1 of them is deleted but not remove from the index. We try 
optimizing. Only 1 document is return when we query again and the 
document docreq is 1.

We checked the source data and the document is not duplicated.
It could be the way we index (full index every time) that result in this 
scenario of having 2 of the same document in the index.

On 9/2/2015 12:11 PM, Erick Erickson wrote:
> How many document total in your corpus? And how many do you
> intend to have?
>
> My point is that if you are testing this with a small corpus, the results
> are very likely different than when you test on a reasonable corpus.
> So if you expect your "real" index will contain many more docs than
> what you're testing, this is likely a red herring.
>
> But something isn't making a lot of sense here. You say you've traced it
> to having a docfreq of 2 that changes to 1. But that means that the
> value is unique in your entire corpus, which kind of indicates you're
> trying to boost on unique values which is unusual.
>
> If you're confident in your model though, the only way to guarantee
> what you want is to optimize/expungeDeletes.
>
> Best,
> Erick
>
> On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh <dp...@globalsources.com> wrote:
>> Erick
>>
>> Yes, we see documents changing their position in the list due to having
>> deleted docs.
>> In our searchresult,weapply higher boost (bq) to a group of matched
>> documents to have them display at the top tier of the result.
>> At times 1 or 2 of these documentsare not return in the top tier, they are
>> relegateddown to the lower tierof the result. Wediscovered that these
>> documents have a lower score due to docFreq=2.
>> After we do an optimize, these 1-2 documents are back in the top tier result
>> order and their docFreqis 1.
>>
>>
>>
>> On 9/1/2015 11:40 PM, Erick Erickson wrote:
>>> Derek:
>>>
>>> Why do you care? What evidence do you have that this matters
>>> _practically_?
>>>
>>> If you've look at scoring with a small number of documents, you'll see
>>> significant
>>> differences due to deleted documents. In most cases, as you get a larger
>>> number
>>> of documents the ranking of documents in an index with no deletions .vs.
>>> indexes
>>> that have deletions is usually not noticeable.
>>>
>>> I'm suggesting that this is a red herring. Your specific situation may
>>> be different
>>> of course, but since scoring is really only about ranking docs
>>> relative to each other,
>>> unless the relative positions change enough to be noticeable it's not a
>>> problem.
>>>
>>> Note that I'm saying "relative rankings", NOT "absolute score". Document
>>> scores
>>> have no meaning outside comparisons to other docs _in the same query_. So
>>> unless you see documents changing their position in the list due to
>>> having deleted
>>> docs, it's not worth spending time on IMO.
>>>
>>> Best,
>>> Erick
>>>
>>> On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <uv...@odoko.co.uk> wrote:
>>>> I wonder if this resolves it [1]. It has been applied to trunk, but not
>>>> to the 5.x release branch.
>>>>
>>>> If you needed it in 5.x, I wonder if there's a way that particular
>>>> choice could be made configurable.
>>>>
>>>> Upayavira
>>>>
>>>> [1] https://issues.apache.org/jira/browse/LUCENE-6711
>>>> On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
>>>>> Hi Upayavira
>>>>>
>>>>> In fact we are using optimize currently but was advised to use expunge
>>>>> deletes as it is less resource intensive.
>>>>> So expunge deletes will only remove deleted documents, it will not merge
>>>>> all index segments into one?
>>>>>
>>>>> If we don't use optimize, the deleted documents in the index will affect
>>>>> the scores (with docFreq=2) of the matched documents which will affect
>>>>> the relevancy of the search result.
>>>>>
>>>>> Derek
>>>>>
>>>>> On 9/1/2015 12:05 AM, Upayavira wrote:
>>>>>> If you really must expunge deletes, use optimize. That will merge all
>>>>>> index segments into one, and in the process will remove any deleted
>>>>>> documents.
>>>>>>
>>>>>> Why do you need to expunge deleted documents anyway? It is generally
>>>>>> done in the background for you, so you shouldn't need to worry about
>>>>>> it.
>>>>>>
>>>>>> Upayavira
>>>>>>
>>>>>> On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> The below curl command worked without error, you can try.
>>>>>>>
>>>>>>> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>>>>>>> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>>>>>>> expungeDeletes="true"/>'
>>>>>>>
>>>>>>> However, after executing this, I could still see same deleted counts
>>>>>>> on
>>>>>>> dashboard.  Deleted Docs:6
>>>>>>> I am not sure whether that means,  the command did not take effect or
>>>>>>> it
>>>>>>> took effect but did not reflect on dashboard view.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi
>>>>>>>>
>>>>>>>> I tried doing a expungeDeletes=true with the following but get the
>>>>>>>> message
>>>>>>>> 'missing content stream'. What am I missing? I need to provide
>>>>>>>> additional
>>>>>>>> parameters?
>>>>>>>>
>>>>>>>> curl
>>>>>>>> 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>>>>>>>> ';
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Derek
>>>>>>>>
>>>>>>>> ----------------------
>>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>>> This e-mail (including any attachments) may contain confidential
>>>>>>>> and/or
>>>>>>>> privileged information. If you are not the intended recipient or have
>>>>>>>> received this e-mail in error, please inform the sender immediately
>>>>>>>> and
>>>>>>>> delete this e-mail (including any attachments) from your computer,
>>>>>>>> and you
>>>>>>>> must not use, disclose to anyone else or copy this e-mail (including
>>>>>>>> any
>>>>>>>> attachments), whether in whole or in part.
>>>>>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>>>>>> regulatory compliance and/or other appropriate reasons.
>>>>>>>>
>>>>>>>>
>>>>> ----------------------
>>>>> CONFIDENTIALITY NOTICE
>>>>>
>>>>> This e-mail (including any attachments) may contain confidential and/or
>>>>> privileged information. If you are not the intended recipient or have
>>>>> received this e-mail in error, please inform the sender immediately and
>>>>> delete this e-mail (including any attachments) from your computer, and
>>>>> you must not use, disclose to anyone else or copy this e-mail (including
>>>>> any attachments), whether in whole or in part.
>>>>>
>>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>>> regulatory compliance and/or other appropriate reasons.
>>>
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and you
>> must not use, disclose to anyone else or copy this e-mail (including any
>> attachments), whether in whole or in part.
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.
>

----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.


Re: 'missing content stream' issuing expungeDeletes=true

Posted by Erick Erickson <er...@gmail.com>.
How many document total in your corpus? And how many do you
intend to have?

My point is that if you are testing this with a small corpus, the results
are very likely different than when you test on a reasonable corpus.
So if you expect your "real" index will contain many more docs than
what you're testing, this is likely a red herring.

But something isn't making a lot of sense here. You say you've traced it
to having a docfreq of 2 that changes to 1. But that means that the
value is unique in your entire corpus, which kind of indicates you're
trying to boost on unique values which is unusual.

If you're confident in your model though, the only way to guarantee
what you want is to optimize/expungeDeletes.

Best,
Erick

On Tue, Sep 1, 2015 at 7:51 PM, Derek Poh <dp...@globalsources.com> wrote:
> Erick
>
> Yes, we see documents changing their position in the list due to having
> deleted docs.
> In our searchresult,weapply higher boost (bq) to a group of matched
> documents to have them display at the top tier of the result.
> At times 1 or 2 of these documentsare not return in the top tier, they are
> relegateddown to the lower tierof the result. Wediscovered that these
> documents have a lower score due to docFreq=2.
> After we do an optimize, these 1-2 documents are back in the top tier result
> order and their docFreqis 1.
>
>
>
> On 9/1/2015 11:40 PM, Erick Erickson wrote:
>>
>> Derek:
>>
>> Why do you care? What evidence do you have that this matters
>> _practically_?
>>
>> If you've look at scoring with a small number of documents, you'll see
>> significant
>> differences due to deleted documents. In most cases, as you get a larger
>> number
>> of documents the ranking of documents in an index with no deletions .vs.
>> indexes
>> that have deletions is usually not noticeable.
>>
>> I'm suggesting that this is a red herring. Your specific situation may
>> be different
>> of course, but since scoring is really only about ranking docs
>> relative to each other,
>> unless the relative positions change enough to be noticeable it's not a
>> problem.
>>
>> Note that I'm saying "relative rankings", NOT "absolute score". Document
>> scores
>> have no meaning outside comparisons to other docs _in the same query_. So
>> unless you see documents changing their position in the list due to
>> having deleted
>> docs, it's not worth spending time on IMO.
>>
>> Best,
>> Erick
>>
>> On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <uv...@odoko.co.uk> wrote:
>>>
>>> I wonder if this resolves it [1]. It has been applied to trunk, but not
>>> to the 5.x release branch.
>>>
>>> If you needed it in 5.x, I wonder if there's a way that particular
>>> choice could be made configurable.
>>>
>>> Upayavira
>>>
>>> [1] https://issues.apache.org/jira/browse/LUCENE-6711
>>> On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
>>>>
>>>> Hi Upayavira
>>>>
>>>> In fact we are using optimize currently but was advised to use expunge
>>>> deletes as it is less resource intensive.
>>>> So expunge deletes will only remove deleted documents, it will not merge
>>>> all index segments into one?
>>>>
>>>> If we don't use optimize, the deleted documents in the index will affect
>>>> the scores (with docFreq=2) of the matched documents which will affect
>>>> the relevancy of the search result.
>>>>
>>>> Derek
>>>>
>>>> On 9/1/2015 12:05 AM, Upayavira wrote:
>>>>>
>>>>> If you really must expunge deletes, use optimize. That will merge all
>>>>> index segments into one, and in the process will remove any deleted
>>>>> documents.
>>>>>
>>>>> Why do you need to expunge deleted documents anyway? It is generally
>>>>> done in the background for you, so you shouldn't need to worry about
>>>>> it.
>>>>>
>>>>> Upayavira
>>>>>
>>>>> On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> The below curl command worked without error, you can try.
>>>>>>
>>>>>> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>>>>>> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>>>>>> expungeDeletes="true"/>'
>>>>>>
>>>>>> However, after executing this, I could still see same deleted counts
>>>>>> on
>>>>>> dashboard.  Deleted Docs:6
>>>>>> I am not sure whether that means,  the command did not take effect or
>>>>>> it
>>>>>> took effect but did not reflect on dashboard view.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi
>>>>>>>
>>>>>>> I tried doing a expungeDeletes=true with the following but get the
>>>>>>> message
>>>>>>> 'missing content stream'. What am I missing? I need to provide
>>>>>>> additional
>>>>>>> parameters?
>>>>>>>
>>>>>>> curl
>>>>>>> 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>>>>>>> ';
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Derek
>>>>>>>
>>>>>>> ----------------------
>>>>>>> CONFIDENTIALITY NOTICE
>>>>>>> This e-mail (including any attachments) may contain confidential
>>>>>>> and/or
>>>>>>> privileged information. If you are not the intended recipient or have
>>>>>>> received this e-mail in error, please inform the sender immediately
>>>>>>> and
>>>>>>> delete this e-mail (including any attachments) from your computer,
>>>>>>> and you
>>>>>>> must not use, disclose to anyone else or copy this e-mail (including
>>>>>>> any
>>>>>>> attachments), whether in whole or in part.
>>>>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>>>>> regulatory compliance and/or other appropriate reasons.
>>>>>>>
>>>>>>>
>>>>
>>>> ----------------------
>>>> CONFIDENTIALITY NOTICE
>>>>
>>>> This e-mail (including any attachments) may contain confidential and/or
>>>> privileged information. If you are not the intended recipient or have
>>>> received this e-mail in error, please inform the sender immediately and
>>>> delete this e-mail (including any attachments) from your computer, and
>>>> you must not use, disclose to anyone else or copy this e-mail (including
>>>> any attachments), whether in whole or in part.
>>>>
>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>> regulatory compliance and/or other appropriate reasons.
>>
>>
>
>
> ----------------------
> CONFIDENTIALITY NOTICE
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and you
> must not use, disclose to anyone else or copy this e-mail (including any
> attachments), whether in whole or in part.
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.

Re: 'missing content stream' issuing expungeDeletes=true

Posted by Derek Poh <dp...@globalsources.com>.
Erick

Yes, we see documents changing their position in the list due to having 
deleted docs.
In our searchresult,weapply higher boost (bq) to a group of matched 
documents to have them display at the top tier of the result.
At times 1 or 2 of these documentsare not return in the top tier, they 
are relegateddown to the lower tierof the result. Wediscovered that 
these documents have a lower score due to docFreq=2.
After we do an optimize, these 1-2 documents are back in the top tier 
result order and their docFreqis 1.



On 9/1/2015 11:40 PM, Erick Erickson wrote:
> Derek:
>
> Why do you care? What evidence do you have that this matters _practically_?
>
> If you've look at scoring with a small number of documents, you'll see
> significant
> differences due to deleted documents. In most cases, as you get a larger number
> of documents the ranking of documents in an index with no deletions .vs. indexes
> that have deletions is usually not noticeable.
>
> I'm suggesting that this is a red herring. Your specific situation may
> be different
> of course, but since scoring is really only about ranking docs
> relative to each other,
> unless the relative positions change enough to be noticeable it's not a problem.
>
> Note that I'm saying "relative rankings", NOT "absolute score". Document scores
> have no meaning outside comparisons to other docs _in the same query_. So
> unless you see documents changing their position in the list due to
> having deleted
> docs, it's not worth spending time on IMO.
>
> Best,
> Erick
>
> On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <uv...@odoko.co.uk> wrote:
>> I wonder if this resolves it [1]. It has been applied to trunk, but not
>> to the 5.x release branch.
>>
>> If you needed it in 5.x, I wonder if there's a way that particular
>> choice could be made configurable.
>>
>> Upayavira
>>
>> [1] https://issues.apache.org/jira/browse/LUCENE-6711
>> On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
>>> Hi Upayavira
>>>
>>> In fact we are using optimize currently but was advised to use expunge
>>> deletes as it is less resource intensive.
>>> So expunge deletes will only remove deleted documents, it will not merge
>>> all index segments into one?
>>>
>>> If we don't use optimize, the deleted documents in the index will affect
>>> the scores (with docFreq=2) of the matched documents which will affect
>>> the relevancy of the search result.
>>>
>>> Derek
>>>
>>> On 9/1/2015 12:05 AM, Upayavira wrote:
>>>> If you really must expunge deletes, use optimize. That will merge all
>>>> index segments into one, and in the process will remove any deleted
>>>> documents.
>>>>
>>>> Why do you need to expunge deleted documents anyway? It is generally
>>>> done in the background for you, so you shouldn't need to worry about it.
>>>>
>>>> Upayavira
>>>>
>>>> On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>>>>> Hi,
>>>>>
>>>>> The below curl command worked without error, you can try.
>>>>>
>>>>> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>>>>> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>>>>> expungeDeletes="true"/>'
>>>>>
>>>>> However, after executing this, I could still see same deleted counts on
>>>>> dashboard.  Deleted Docs:6
>>>>> I am not sure whether that means,  the command did not take effect or it
>>>>> took effect but did not reflect on dashboard view.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>>>>> wrote:
>>>>>
>>>>>> Hi
>>>>>>
>>>>>> I tried doing a expungeDeletes=true with the following but get the message
>>>>>> 'missing content stream'. What am I missing? I need to provide additional
>>>>>> parameters?
>>>>>>
>>>>>> curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>>>>>> ';
>>>>>>
>>>>>> Thanks,
>>>>>> Derek
>>>>>>
>>>>>> ----------------------
>>>>>> CONFIDENTIALITY NOTICE
>>>>>> This e-mail (including any attachments) may contain confidential and/or
>>>>>> privileged information. If you are not the intended recipient or have
>>>>>> received this e-mail in error, please inform the sender immediately and
>>>>>> delete this e-mail (including any attachments) from your computer, and you
>>>>>> must not use, disclose to anyone else or copy this e-mail (including any
>>>>>> attachments), whether in whole or in part.
>>>>>> This e-mail and any reply to it may be monitored for security, legal,
>>>>>> regulatory compliance and/or other appropriate reasons.
>>>>>>
>>>>>>
>>>
>>> ----------------------
>>> CONFIDENTIALITY NOTICE
>>>
>>> This e-mail (including any attachments) may contain confidential and/or
>>> privileged information. If you are not the intended recipient or have
>>> received this e-mail in error, please inform the sender immediately and
>>> delete this e-mail (including any attachments) from your computer, and
>>> you must not use, disclose to anyone else or copy this e-mail (including
>>> any attachments), whether in whole or in part.
>>>
>>> This e-mail and any reply to it may be monitored for security, legal,
>>> regulatory compliance and/or other appropriate reasons.
>


----------------------
CONFIDENTIALITY NOTICE 

This e-mail (including any attachments) may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please inform the sender immediately and delete this e-mail (including any attachments) from your computer, and you must not use, disclose to anyone else or copy this e-mail (including any attachments), whether in whole or in part. 

This e-mail and any reply to it may be monitored for security, legal, regulatory compliance and/or other appropriate reasons.

Re: 'missing content stream' issuing expungeDeletes=true

Posted by Erick Erickson <er...@gmail.com>.
Derek:

Why do you care? What evidence do you have that this matters _practically_?

If you've look at scoring with a small number of documents, you'll see
significant
differences due to deleted documents. In most cases, as you get a larger number
of documents the ranking of documents in an index with no deletions .vs. indexes
that have deletions is usually not noticeable.

I'm suggesting that this is a red herring. Your specific situation may
be different
of course, but since scoring is really only about ranking docs
relative to each other,
unless the relative positions change enough to be noticeable it's not a problem.

Note that I'm saying "relative rankings", NOT "absolute score". Document scores
have no meaning outside comparisons to other docs _in the same query_. So
unless you see documents changing their position in the list due to
having deleted
docs, it's not worth spending time on IMO.

Best,
Erick

On Tue, Sep 1, 2015 at 12:45 AM, Upayavira <uv...@odoko.co.uk> wrote:
> I wonder if this resolves it [1]. It has been applied to trunk, but not
> to the 5.x release branch.
>
> If you needed it in 5.x, I wonder if there's a way that particular
> choice could be made configurable.
>
> Upayavira
>
> [1] https://issues.apache.org/jira/browse/LUCENE-6711
> On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
>> Hi Upayavira
>>
>> In fact we are using optimize currently but was advised to use expunge
>> deletes as it is less resource intensive.
>> So expunge deletes will only remove deleted documents, it will not merge
>> all index segments into one?
>>
>> If we don't use optimize, the deleted documents in the index will affect
>> the scores (with docFreq=2) of the matched documents which will affect
>> the relevancy of the search result.
>>
>> Derek
>>
>> On 9/1/2015 12:05 AM, Upayavira wrote:
>> > If you really must expunge deletes, use optimize. That will merge all
>> > index segments into one, and in the process will remove any deleted
>> > documents.
>> >
>> > Why do you need to expunge deleted documents anyway? It is generally
>> > done in the background for you, so you shouldn't need to worry about it.
>> >
>> > Upayavira
>> >
>> > On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
>> >> Hi,
>> >>
>> >> The below curl command worked without error, you can try.
>> >>
>> >> curl http://localhost:8983/solr/techproducts/update?commit=true -H
>> >> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
>> >> expungeDeletes="true"/>'
>> >>
>> >> However, after executing this, I could still see same deleted counts on
>> >> dashboard.  Deleted Docs:6
>> >> I am not sure whether that means,  the command did not take effect or it
>> >> took effect but did not reflect on dashboard view.
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
>> >> wrote:
>> >>
>> >>> Hi
>> >>>
>> >>> I tried doing a expungeDeletes=true with the following but get the message
>> >>> 'missing content stream'. What am I missing? I need to provide additional
>> >>> parameters?
>> >>>
>> >>> curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
>> >>> ';
>> >>>
>> >>> Thanks,
>> >>> Derek
>> >>>
>> >>> ----------------------
>> >>> CONFIDENTIALITY NOTICE
>> >>> This e-mail (including any attachments) may contain confidential and/or
>> >>> privileged information. If you are not the intended recipient or have
>> >>> received this e-mail in error, please inform the sender immediately and
>> >>> delete this e-mail (including any attachments) from your computer, and you
>> >>> must not use, disclose to anyone else or copy this e-mail (including any
>> >>> attachments), whether in whole or in part.
>> >>> This e-mail and any reply to it may be monitored for security, legal,
>> >>> regulatory compliance and/or other appropriate reasons.
>> >>>
>> >>>
>> >
>>
>>
>> ----------------------
>> CONFIDENTIALITY NOTICE
>>
>> This e-mail (including any attachments) may contain confidential and/or
>> privileged information. If you are not the intended recipient or have
>> received this e-mail in error, please inform the sender immediately and
>> delete this e-mail (including any attachments) from your computer, and
>> you must not use, disclose to anyone else or copy this e-mail (including
>> any attachments), whether in whole or in part.
>>
>> This e-mail and any reply to it may be monitored for security, legal,
>> regulatory compliance and/or other appropriate reasons.

Re: 'missing content stream' issuing expungeDeletes=true

Posted by Upayavira <uv...@odoko.co.uk>.
I wonder if this resolves it [1]. It has been applied to trunk, but not
to the 5.x release branch.

If you needed it in 5.x, I wonder if there's a way that particular
choice could be made configurable.

Upayavira

[1] https://issues.apache.org/jira/browse/LUCENE-6711
On Tue, Sep 1, 2015, at 02:43 AM, Derek Poh wrote:
> Hi Upayavira
> 
> In fact we are using optimize currently but was advised to use expunge 
> deletes as it is less resource intensive.
> So expunge deletes will only remove deleted documents, it will not merge 
> all index segments into one?
> 
> If we don't use optimize, the deleted documents in the index will affect 
> the scores (with docFreq=2) of the matched documents which will affect 
> the relevancy of the search result.
> 
> Derek
> 
> On 9/1/2015 12:05 AM, Upayavira wrote:
> > If you really must expunge deletes, use optimize. That will merge all
> > index segments into one, and in the process will remove any deleted
> > documents.
> >
> > Why do you need to expunge deleted documents anyway? It is generally
> > done in the background for you, so you shouldn't need to worry about it.
> >
> > Upayavira
> >
> > On Mon, Aug 31, 2015, at 06:46 AM, davidphilip cherian wrote:
> >> Hi,
> >>
> >> The below curl command worked without error, you can try.
> >>
> >> curl http://localhost:8983/solr/techproducts/update?commit=true -H
> >> "Content-Type: text/xml" --data-binary '<commit waitSearcher="false"
> >> expungeDeletes="true"/>'
> >>
> >> However, after executing this, I could still see same deleted counts on
> >> dashboard.  Deleted Docs:6
> >> I am not sure whether that means,  the command did not take effect or it
> >> took effect but did not reflect on dashboard view.
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Aug 31, 2015 at 8:51 AM, Derek Poh <dp...@globalsources.com>
> >> wrote:
> >>
> >>> Hi
> >>>
> >>> I tried doing a expungeDeletes=true with the following but get the message
> >>> 'missing content stream'. What am I missing? I need to provide additional
> >>> parameters?
> >>>
> >>> curl 'http://127.0.0.1:8983/solr/supplier/update/json?expungeDeletes=true
> >>> ';
> >>>
> >>> Thanks,
> >>> Derek
> >>>
> >>> ----------------------
> >>> CONFIDENTIALITY NOTICE
> >>> This e-mail (including any attachments) may contain confidential and/or
> >>> privileged information. If you are not the intended recipient or have
> >>> received this e-mail in error, please inform the sender immediately and
> >>> delete this e-mail (including any attachments) from your computer, and you
> >>> must not use, disclose to anyone else or copy this e-mail (including any
> >>> attachments), whether in whole or in part.
> >>> This e-mail and any reply to it may be monitored for security, legal,
> >>> regulatory compliance and/or other appropriate reasons.
> >>>
> >>>
> >
> 
> 
> ----------------------
> CONFIDENTIALITY NOTICE 
> 
> This e-mail (including any attachments) may contain confidential and/or
> privileged information. If you are not the intended recipient or have
> received this e-mail in error, please inform the sender immediately and
> delete this e-mail (including any attachments) from your computer, and
> you must not use, disclose to anyone else or copy this e-mail (including
> any attachments), whether in whole or in part. 
> 
> This e-mail and any reply to it may be monitored for security, legal,
> regulatory compliance and/or other appropriate reasons.