You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jackrabbit.apache.org by Nelson Takashi Omori <ne...@murah.com.br> on 2012/11/22 19:53:50 UTC

Rebuilding index

Hi All,

I'm using Jackrabbit 2.4.3 and my repository has approximately 110 
thousand nodes. From these, about 10 thousand nodes has binary values, 
wich the content need to be extracted, using Tika, and indexed in Lucene.

I decided to delete the index to make Jackrabbit create them again. The 
problem is the time that this operation is taking. I waited for 3 hours 
and the repository wasn't initialized (I don't know exactly how long it 
take to complete the repository initialization, because I stopped the 
process). Disabling Tika's text extraction, it took 5 minutes, so I 
concluded that the problem is the time that Tika takes to extract the 10 
thousand documents.

If the index become inconsistent and I have to execute the rebuild, my 
client doesn't want to wait for more than 3 hours to start using the 
system. So I'm planning to create a subclass of 
org.apache.jackrabbit.core.query.lucene.SearchIndex and try to modify 
how the indexes are re-created. To give to my client a fast access to 
the repository, first I'll ignore the text extraction and create the 
index with normal properties. With this structure, I can give access to 
the repository to my client and he can do many things using only the 
normal properties. So, in background, I'll start the text extraction of 
each document and update Lucene's document with extracted value.

I have some questions about it.
1) Reading the source code, jackrabbit is using LazyTextExtractorField 
(and other classes) to execute the extraction in a separate thread. 
Doesn't it do exactly what I want? But, even so I waited 3 hours and the 
repository wasn't initialized and ready to use. Is it normal?
2)  What I'm planning to do is the best approach? Did anybody make 
something similar?

Thanks,

Nelson


Re: AW: Rebuilding index

Posted by Nelson Takashi Omori <ne...@murah.com.br>.
Thank you, Claus.

I'll try to configure a cluster and see how it works.

Another thing on the process of rebuilding the index, is that my 
computer's CPU usage was constantly on 16% and access to the HD and 
memory usage was low too. So my computer's resources weren't used 
completely. So I executed on debug mode and I saw this message, many times:
"Executor is under load, will schedule 1987 remaining tasks for 50 ms later"

Searching deeply, I found that Jackrabbit creates a text extraction task 
as a low priority task. The execution of this kind of task is controlled 
by the value "maxLoadForLowPriorityTasks" in the JackrabbitThreadPool, 
wich is defined by the value from a system parameter 
"org.apache.jackrabbit.core.JackrabbitThreadPool.maxLoadForLowPriorityTasks". 
If this value doesn't exist or it's not between 0 and 100, Jackrabbit 
uses 75 by default. This value is used to determine if it's possible to 
execute a low priority task, checking the number of threads that are 
active in the moment. Using default value, if more than 75% of threads 
are in use, the task will be scheduled for later.

So I set the parameter 
"org.apache.jackrabbit.core.JackrabbitThreadPool.maxLoadForLowPriorityTasks" 
to "0" and Jackrabbit ignores the verification and the process was 
faster, about 2hours to complete the rebuild. The CPU usage was floating 
from 50% to 90%, memory was used up to the limit and the HD was accessed 
more constantly. Maybe it's better to increase the memory allocated 
before you execute this.

In my scenario, it make sense to set this value to "0", because while 
the rebuild process is executing, my client can't use the system, so I 
can use all the resources that I have to finish as soon as possible. 
After the rebuild process, you should remove the parameter, so 
Jackrabbit can control the execution of low priority task again.

Maybe this can help someone who have to rebuild the index as soon as 
possible and don't have a cluster mentioned by Clauss.

Em 23/11/2012 06:29, KÖLL Claus escreveu:
> Hi Nelson,
>
>> 1) Reading the source code, jackrabbit is using LazyTextExtractorField (and other classes) to execute the extraction in a separate thread.
>> Doesn't it do exactly what I want? But, even so I waited 3 hours and the repository wasn't initialized and ready to use. Is it normal?
> First .. yes this is normal ..
> and yes you are right about extraction in a separate thread .. this happens on session.save() operation. If you start the repository it will start to re-index it if the index is not present.
> In that way jackrabbit does not separate between full text indexing and "normal" node/property indexing. So the start will take much time
> depending on your content.
>
>> 2)  What I'm planning to do is the best approach? Did anybody make something similar?
> One way to handle such index recovering is to create a cluster. Let's assume you would have 2 cluster members where one is the primary and the other one is a hot standby member.
> If you have problems with the index on the primary cluster member you could copy the index folder from the standby cluster member.
> If you like you could re-index the repository on your standby member while the primary is running.
>
> greets
> claus
>

AW: Rebuilding index

Posted by KÖLL Claus <C....@TIROL.GV.AT>.
Hi Nelson,

>1) Reading the source code, jackrabbit is using LazyTextExtractorField (and other classes) to execute the extraction in a separate thread. 
>Doesn't it do exactly what I want? But, even so I waited 3 hours and the repository wasn't initialized and ready to use. Is it normal?

First .. yes this is normal .. 
and yes you are right about extraction in a separate thread .. this happens on session.save() operation. If you start the repository it will start to re-index it if the index is not present.
In that way jackrabbit does not separate between full text indexing and "normal" node/property indexing. So the start will take much time
depending on your content.

>2)  What I'm planning to do is the best approach? Did anybody make something similar?

One way to handle such index recovering is to create a cluster. Let's assume you would have 2 cluster members where one is the primary and the other one is a hot standby member.
If you have problems with the index on the primary cluster member you could copy the index folder from the standby cluster member. 
If you like you could re-index the repository on your standby member while the primary is running.

greets
claus

Re: Rebuilding index

Posted by Nicolas Peltier <np...@adobe.com>.
Hi,

There's not a lot of reason for which the index becomes inconsistent
(mainly brutal stop of the server), and there are ways to fix
inconsistencies (that take time as well). If your search is "well defined"
(i.e. You know that you are/will be searching only for certain
nodes/properties), a simpler way to go is with indexConfiguration
(configuring index for only those nodes/properties).
 
Nicolas

On 11/22/12 7:53 PM, "Nelson Takashi Omori" <ne...@murah.com.br>
wrote:

>Hi All,
>
>I'm using Jackrabbit 2.4.3 and my repository has approximately 110
>thousand nodes. From these, about 10 thousand nodes has binary values,
>wich the content need to be extracted, using Tika, and indexed in Lucene.
>
>I decided to delete the index to make Jackrabbit create them again. The
>problem is the time that this operation is taking. I waited for 3 hours
>and the repository wasn't initialized (I don't know exactly how long it
>take to complete the repository initialization, because I stopped the
>process). Disabling Tika's text extraction, it took 5 minutes, so I
>concluded that the problem is the time that Tika takes to extract the 10
>thousand documents.
>
>If the index become inconsistent and I have to execute the rebuild, my
>client doesn't want to wait for more than 3 hours to start using the
>system. So I'm planning to create a subclass of
>org.apache.jackrabbit.core.query.lucene.SearchIndex and try to modify
>how the indexes are re-created. To give to my client a fast access to
>the repository, first I'll ignore the text extraction and create the
>index with normal properties. With this structure, I can give access to
>the repository to my client and he can do many things using only the
>normal properties. So, in background, I'll start the text extraction of
>each document and update Lucene's document with extracted value.
>
>I have some questions about it.
>1) Reading the source code, jackrabbit is using LazyTextExtractorField
>(and other classes) to execute the extraction in a separate thread.
>Doesn't it do exactly what I want? But, even so I waited 3 hours and the
>repository wasn't initialized and ready to use. Is it normal?
>2)  What I'm planning to do is the best approach? Did anybody make
>something similar?
>
>Thanks,
>
>Nelson
>