You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Alex Parvulescu (Created) (JIRA)" <ji...@apache.org> on 2011/11/14 14:36:51 UTC

[jira] [Created] (JCR-3146) Text extraction may congest thread pool in the repository

Text extraction may congest thread pool in the repository
---------------------------------------------------------

                 Key: JCR-3146
                 URL: https://issues.apache.org/jira/browse/JCR-3146
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: jackrabbit-core
            Reporter: Alex Parvulescu
            Priority: Minor


Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Alex Parvulescu (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu resolved JCR-3146.
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.4
         Assignee: Alex Parvulescu

fixed in rev:1202192
                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.4
>
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Jukka Zitting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-3146:
-------------------------------

    Fix Version/s: 2.2.12

Merged to the 2.2 branch in revision 1242468.
                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.2.12, 2.3.4
>
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Alex Parvulescu (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alex Parvulescu updated JCR-3146:
---------------------------------

    Attachment: JCR-3146.patch

The solution is to define another queue for the tasks considered as low priority, so that they don't fill the execution queue.
Then, depending on the executor's load poll this queue for additional work items.

The secondary queue will only be used as needed, and the load is configurable via the system property 
"org.apache.jackrabbit.core.JackrabbitThreadPool.maxLoadForLowPriorityTasks"
This property is meant to be used as a percent. 0 means disabled / the default is 75.

There are some timing issues with the indexing tests on account of this new async text extraction. I've tried to fix all of them, but there may be more.

I haven't touched yet on the tika extraction that happens in a different process. I think that will need some minor refactoring as well.

Attaching proposed patch.


                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Priority: Minor
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Jukka Zitting (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-3146:
-------------------------------

    Fix Version/s:     (was: 2.4)
                   2.3.4
    
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Assignee: Alex Parvulescu
>            Priority: Minor
>             Fix For: 2.3.4
>
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Marcel Reutegger (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150424#comment-13150424 ] 

Marcel Reutegger commented on JCR-3146:
---------------------------------------

In general looks good to me.

I'm not sure the communication between the threads in JackrabbitThreadPool
is 100% correct. E.g. the first statement in RetryLowPriorityTask.run()
checks if the queue is empty. To me it seems like this should never
happen, right?

Style:
Should we rather keep the JackrabbitThreadPool package private and
only expose the marker as public interface? How about renaming
the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from
Runnable? That way a client wouldn't have to implement Runnable
and the marker interface.

Minor:
 The method waitForTextExtractionTasksToFinish() already does an
index flush at the end. Aren't the additional index flush calls
in IndexingQueueTest now obsolete?

                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Priority: Minor
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Marcel Reutegger (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150458#comment-13150458 ] 

Marcel Reutegger commented on JCR-3146:
---------------------------------------

> you mean like moving the interface to the util package?

no, I would still keep it in the same package as the JackrabbitThreadPool
                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Priority: Minor
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (JCR-3146) Text extraction may congest thread pool in the repository

Posted by "Alex Parvulescu (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-3146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13150450#comment-13150450 ] 

Alex Parvulescu commented on JCR-3146:
--------------------------------------

thanks for taking the time to review the patch

> To me it seems like this should never happen, right? 
yes, that is just premature optimization. I'll remove it.

> Should we rather keep the JackrabbitThreadPool package private and only expose the marker as public interface?
you mean like moving the interface to the util package?

> How about renaming the LOW_PRIORITY_MARKER to LowPriorityTask and extend it from Runnable?
that is a good idea

> Aren't the additional index flush calls in IndexingQueueTest now obsolete?
true. I had some issues with timing which are now hopefully fixed, so yes we can remove the extra flush


                
> Text extraction may congest thread pool in the repository
> ---------------------------------------------------------
>
>                 Key: JCR-3146
>                 URL: https://issues.apache.org/jira/browse/JCR-3146
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: jackrabbit-core
>            Reporter: Alex Parvulescu
>            Priority: Minor
>         Attachments: JCR-3146.patch
>
>
> Text extraction congests the thread pool in the repository when e.g. many PDFs are loaded into the workspace. Tasks submitted by the index merger are delayed because of that and will result in many index segment folders.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira