You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by Michael D Robinson <md...@thoughtworks.com> on 2010/01/24 15:02:58 UTC

Problem rebuilding Lucene index on large repository

Hi,

We have a fairly large CQ 5.2.1 repository (~4m digital image assets, ~10m
BLOBs) which crashed, resulting in index corruption.  We removed the index
and are now rebuilding it.  The process is going rather slowly; at current
rate of progress it will take 5 days, which we cannot afford.

Analysis points to this following behavior as a significant limiting factor
on the reindexing speed.  It appears Jackrabbit is opening every BLOB in the
data store, reading the first 8 bytes, closing the BLOB, then reopening the
BLOB, rereading the first 8 bytes, closing the BLOB, then yet again opening
the BLOB, reading the first 8 bytes, and closing the BLOB.  Then, in
addition to all this, there are some number of times that it will open the
BLOB and close it again without reading anything.

Here is a sample:

[pid 23320] 13:37:54.522751
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1072 <0.000293>
[pid 23320] 13:37:54.523139 read(1072, "\377\330\377\340\0\20JF", 8) = 8
<0.000017>
[pid 23320] 13:37:54.523204 close(1072) = 0 <0.000016>
[pid 23320] 13:37:54.523340
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1072 <0.000307>
[pid 23320] 13:37:54.523842
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1314 <0.000303>
[pid 23320] 13:37:54.524225 read(1314, "\377\330\377\340\0\20JF", 8) = 8
<0.000014>
[pid 23320] 13:37:54.524285 close(1314) = 0 <0.000014>
[pid 23320] 13:37:54.524411
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1314 <0.000299>
[pid 23320] 13:37:54.524902
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1318 <0.000292>
[pid 23320] 13:37:54.525275 read(1318, "\377\330\377\340\0\20JF", 8) = 8
<0.000015>
[pid 23320] 13:37:54.525335 close(1318) = 0 <0.000014>
[pid 23320] 13:37:54.525463
open("/Blob02/crx-datastore/1/1b0/11b03c922843a5a628f3557d828288e32a4b51ce",
O_RDONLY) = 1318 <0.000283>
[pid  3721] 13:37:54.579873 close(1318) = 0 <0.000020>

I am informed that this is because Lucene tries to do a full text index on
all BLOBs, irrespective of what file type they may be.

Does anyone know whether it is possible to disable this behavior?

Thanks.

    Sincerely,
    Michael Robinson