You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "Avni, Itamar" <It...@verint.com> on 2011/12/29 19:55:58 UTC

Frequent Indexing of same Documents

Hi community,

Say I have lots of documents to index, each with primary key in the index, and I index them frequently.
They are not indexed all together (like in bulk), but each in a different time.

1) Is there a significant difference in performances between a freshly created core ("the first time to index"), to an "old" core (every document already exists is the core)?
2) when updating a document, is it updated in-place, or is the old copy (according to primary key) marked as deleted and a new document is inserted?
3) will indexing the same documents over and over again will increase the size of the index? (assuming the documents did not changed much)
4) sorry for the dumb questions. My boss is making me ask them :-D .

Iavni


This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
The information is intended to be for the use of the individual(s) or
entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.


Re: Frequent Indexing of same Documents

Posted by Erick Erickson <er...@gmail.com>.
See below

1) Is there a significant difference in performances between a freshly
created core ("the first time to index"), to an "old" core (every
document already exists is the core)?
not really. Documents are indexed in "segments", and a fresh one is
usually opened after every commit (you only commit after some time,
it's not once per document).

2) when updating a document, is it updated in-place, or is the old
copy (according to primary key) marked as deleted and a new document
is inserted?

The old copy is marked as deleted and a complete new document is added
to the index. The deleted copy of the document is gradually purged
over time as segments are merged. All of them can be purged by an
optimize, but this step is often unnecessary.

3) will indexing the same documents over and over again will increase
the size of the index? (assuming the documents did not changed much)
Yes. Since it's a delete (mark as) followed by an add the index will
get bigger. As above, though, the space is reclaimed as time passes.


4) sorry for the dumb questions. My boss is making me ask them :-D .
A decent question is "why does your boss want to know"? That is, what
is the higher-level question that's causing the worry? re-indexing the
same document causes the index to grow. You don't care if you're
indexing a measly 100,000 documents. You might care if you're indexing
100,000,000 docs.

Best
Erick
On Thu, Dec 29, 2011 at 2:01 PM, Gora Mohanty <go...@mimirtech.com> wrote:
> On Fri, Dec 30, 2011 at 12:25 AM, Avni, Itamar <It...@verint.com> wrote:
> [...]
>> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
>> The information is intended to be for the use of the individual(s) or
>> entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
>>
>
> I would reply, but am afraid to, as I am not in the list of
> individual(s) nor entity(ies) [sic!] named above.
>
> Regards,
> Gora

Re: Frequent Indexing of same Documents

Posted by Gora Mohanty <go...@mimirtech.com>.
On Fri, Dec 30, 2011 at 12:25 AM, Avni, Itamar <It...@verint.com> wrote:
[...]
> This electronic message may contain proprietary and confidential information of Verint Systems Inc., its affiliates and/or subsidiaries.
> The information is intended to be for the use of the individual(s) or
> entity(ies) named above.  If you are not the intended recipient (or authorized to receive this e-mail for the intended recipient), you may not use, copy, disclose or distribute to anyone this message or any information contained in this message.  If you have received this electronic message in error, please notify us by replying to this e-mail.
>

I would reply, but am afraid to, as I am not in the list of
individual(s) nor entity(ies) [sic!] named above.

Regards,
Gora