You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jackrabbit.apache.org by "Jukka Zitting (JIRA)" <ji...@apache.org> on 2009/07/16 17:12:14 UTC

[jira] Updated: (JCR-2219) Improved background text extraction

     [ https://issues.apache.org/jira/browse/JCR-2219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jukka Zitting updated JCR-2219:
-------------------------------

    Attachment: JCR-2219.patch

Attached a patch that starts the background text extraction thread as early as possible and counts the extraction timeout not only against the creation of a Reader but also against reading the extracted text from the Reader.

Note that the patch buffers the *entire* extracted text into memory before passing it on to indexing. Currently we in any case buffer the text to a String, so this isn't that much of a regression (though now we have two copies of the string) but obviously it would be better if we could avoid that.

Some of the test cases had implicit assumptions about indexing speed that were broken by these changes. Based on some previous code snippets I added a new SearchIndex.flush() method that makes sure that all pending index changes have been processed and flushed to disk. This method is now automatically called by the executeSQLQuery() and executeXPATHQuery() methods in AbstractQueryTest to avoid any issues with late index updates. Later on we might find some uses for the new flush() method also outside the test suite.

Things to do:

* The patch still mostly follows the existing code structure to make it easier to review the changes. We could probably simplify the code and avoid the extra String copy of the extracted text by merging the TextExtractorReader and TextExtractorJob classes.

* Going further, we could probably drop the PooledTextExtractor class in favor of a simpler thread pool that the NodeIndexer would use to execute TextExtractorJobs.


> Improved background text extraction
> -----------------------------------
>
>                 Key: JCR-2219
>                 URL: https://issues.apache.org/jira/browse/JCR-2219
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>          Components: indexing, jackrabbit-core
>            Reporter: Jukka Zitting
>            Priority: Minor
>         Attachments: JCR-2219.patch
>
>
> As recently discussed on the mailing list (see http://markmail.org/message/syt7lc2guzapt7la), the current approach to text extraction in background threads doesn't work that well especially with the Tika-based extractors that support streamed parsing of many document types.
> Also, we currently *all* of the extracted text streams are buffered into Strings before being passed into the Lucene index. It would be good if we could somehow get back to passing just Readers to Lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.