You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Marcel Reutegger (JIRA)" <ji...@apache.org> on 2006/04/10 10:23:58 UTC

[jira] Created: (JCR-390) Move text extraction into a background thread

Move text extraction into a background thread
---------------------------------------------

         Key: JCR-390
         URL: http://issues.apache.org/jira/browse/JCR-390
     Project: Jackrabbit
        Type: Improvement

  Components: indexing  
    Versions: 1.0    
 Environment: all
    Reporter: Marcel Reutegger
 Assigned to: Marcel Reutegger 
    Priority: Minor


Even though text extraction is not done right on save() most of the extraction work is later done by a client thread. There is a mechanism in place that commits the deferred work in a background thread. But the background thread is only triggered by a timer and does not constantly write back pending index changes. For regular index changes this is done on purpose and should not be changed. However text extraction work should be moved completely into a background thread because it often takes a fair amount of time to index a large document.

Outline of a possible solution:
- all text filtering is tasks are put into a work queue
- the work queue is processed by a background thread
- basic indexing of nt:resource without text filtering takes place
- the background thread updates the index when text filtering completed for a nt:resource

There should be a configuration parameter that allows to execute text filtering without the background thread. This way it is possible to get the existing behaviour of Jackrabbit: the fulltext index is always up-to-date and can be used.
With the background process this is no longer the case.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Updated: (JCR-390) Move text extraction into a background thread

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ http://issues.apache.org/jira/browse/JCR-390?page=all ]

Jukka Zitting updated JCR-390:
------------------------------

    Version: 1.0.1

> Move text extraction into a background thread
> ---------------------------------------------
>
>          Key: JCR-390
>          URL: http://issues.apache.org/jira/browse/JCR-390
>      Project: Jackrabbit
>         Type: Improvement

>   Components: indexing
>     Versions: 1.0, 1.0.1
>  Environment: all
>     Reporter: Marcel Reutegger
>     Assignee: Marcel Reutegger
>     Priority: Minor

>
> Even though text extraction is not done right on save() most of the extraction work is later done by a client thread. There is a mechanism in place that commits the deferred work in a background thread. But the background thread is only triggered by a timer and does not constantly write back pending index changes. For regular index changes this is done on purpose and should not be changed. However text extraction work should be moved completely into a background thread because it often takes a fair amount of time to index a large document.
> Outline of a possible solution:
> - all text filtering is tasks are put into a work queue
> - the work queue is processed by a background thread
> - basic indexing of nt:resource without text filtering takes place
> - the background thread updates the index when text filtering completed for a nt:resource
> There should be a configuration parameter that allows to execute text filtering without the background thread. This way it is possible to get the existing behaviour of Jackrabbit: the fulltext index is always up-to-date and can be used.
> With the background process this is no longer the case.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


[jira] Resolved: (JCR-390) Move text extraction into a background thread

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger resolved JCR-390.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 1.3

Implemented as described. See sample repository.xml in src/main/config for details how to configure background indexing.

For now this feature is disabled per default because it changes the indexing behaviour slightly compared to the previous implementation that always extracted text using the current thread.

SVN revision: 497067

> Move text extraction into a background thread
> ---------------------------------------------
>
>                 Key: JCR-390
>                 URL: https://issues.apache.org/jira/browse/JCR-390
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: indexing
>    Affects Versions: 1.0, 1.0.1
>         Environment: all
>            Reporter: Marcel Reutegger
>         Assigned To: Marcel Reutegger
>            Priority: Minor
>             Fix For: 1.3
>
>
> Even though text extraction is not done right on save() most of the extraction work is later done by a client thread. There is a mechanism in place that commits the deferred work in a background thread. But the background thread is only triggered by a timer and does not constantly write back pending index changes. For regular index changes this is done on purpose and should not be changed. However text extraction work should be moved completely into a background thread because it often takes a fair amount of time to index a large document.
> Outline of a possible solution:
> - all text filtering is tasks are put into a work queue
> - the work queue is processed by a background thread
> - basic indexing of nt:resource without text filtering takes place
> - the background thread updates the index when text filtering completed for a nt:resource
> There should be a configuration parameter that allows to execute text filtering without the background thread. This way it is possible to get the existing behaviour of Jackrabbit: the fulltext index is always up-to-date and can be used.
> With the background process this is no longer the case.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira