You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/07/16 16:54:15 UTC

[jira] Created: (JCR-2219) Improved background text extraction

Improved background text extraction
-----------------------------------

                 Key: JCR-2219
                 URL: https://issues.apache.org/jira/browse/JCR-2219
             Project: Jackrabbit Content Repository
          Issue Type: Improvement
          Components: indexing, jackrabbit-core
            Reporter: Jukka Zitting
            Priority: Minor


As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.

Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2219) Improved background text extraction

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739447#action_12739447 ] 

Marcel Reutegger commented on JCR-2219:
---------------------------------------

Fixed TextExtractorJob (was using transient keyword, when in fact wanted to use volatile!) in revision: 801169

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2219) Improved background text extraction

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739487#action_12739487 ] 

Marcel Reutegger commented on JCR-2219:
---------------------------------------

> Fixed occasional test failures in revision: 801135 

this change introduced other failures. I reverted above change in revision: 801226

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2219) Improved background text extraction

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-2219:
-------------------------------

    Status: Patch Available  (was: Open)

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2219) Improved background text extraction

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Marcel Reutegger updated JCR-2219:
----------------------------------

       Resolution: Fixed
    Fix Version/s: 2.0.0
           Status: Resolved  (was: Patch Available)

Looks very good.

Applied patch in revision: 799610

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2219) Improved background text extraction

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739435#action_12739435 ] 

Marcel Reutegger commented on JCR-2219:
---------------------------------------

Fixed occasional test failures in revision: 801135

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (JCR-2219) Improved background text extraction

Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-2219:
-------------------------------

    Attachment: JCR-2219.patch

Attached a patch that starts the background text extraction thread as early as possible and counts the extraction timeout not only against the creation of a Reader but also against reading the extracted text from the Reader.

Note that the patch buffers the *entire* extracted text into memory before passing it on to indexing. Currently we in any case buffer the text to a String, so this isn't that much of a regression (though now we have two copies of the string) but obviously it would be better if we could avoid that.

Some of the test cases had implicit assumptions about indexing speed that were broken by these changes. Based on some previous code snippets I added a new SearchIndex.flush() method that makes sure that all pending index changes have been processed and flushed to disk. This method is now automatically called by the executeSQLQuery() and executeXPATHQuery() methods in AbstractQueryTest to avoid any issues with late index updates. Later on we might find some uses for the new flush() method also outside the test suite.

Things to do:

* The patch still mostly follows the existing code structure to make it easier to review the changes. We could probably simplify the code and avoid the extra String copy of the extracted text by merging the TextExtractorReader and TextExtractorJob classes.

* Going further, we could probably drop the PooledTextExtractor class in favor of a simpler thread pool that the NodeIndexer would use to execute TextExtractorJobs.


> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (JCR-2219) Improved background text extraction

Posted by "Marcel Reutegger (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12739690#action_12739690 ] 

Marcel Reutegger commented on JCR-2219:
---------------------------------------

Re-applied some of the 801135 changes to make test execution more reliable.

svn revision: 801375

> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>             Fix For: 2.0.0
>
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.