You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@james.apache.org by GitBox <gi...@apache.org> on 2022/03/03 04:18:48 UTC

[GitHub] [james-project] chibenwa opened a new pull request #902: JAMES-3719 Reactive textual content extraction with Apache Tika

chibenwa opened a new pull request #902:
URL: https://github.com/apache/james-project/pull/902


   Tika was called from reactive code and was doing blocking HTTP calls from within
   the MIME parsing code.
   
   This generate:
    - An unneeded thread consumption as we have some threads waiting for Tika
      response
    - Potentially dangerous blocking calls: for instance the InVM event bus was
     doing such calls on the parallel thread pool (where it is critical NOT to
     block)...
    - Also the connection was opened on a per-call basis, not being reused.
   
    We introduce the following changes:
     - Reactification of the TextExtractor API
     - We re-implement the HTTP calls done by TikaTextExtractor with reactor-netty
     which allows us to pool HTTP connections and do this in a non-blocking
     reactive fashion.
     - We provide a reactive cache using the caffeine caching library - Guava
     caches are blocking thus not an option...
     - We uncouple the text extraction from the MIME parsing phase by introducing
     an intermediate POJO. Doing so requires us to do a post-parsing copy of
     content.
   
    Only do the copy if necessary. We don't want to copy large attachments for whom no text is going to be extracted...
   
     - Finally we reactify index content generation for ElasticSearch code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscribe@james.apache.org
For additional commands, e-mail: notifications-help@james.apache.org