You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (Jira)" <ji...@apache.org> on 2021/09/17 11:23:00 UTC

[jira] [Created] (NUTCH-2891) Upgrade to Tika 2.1

Sebastian Nagel created NUTCH-2891:
--------------------------------------

             Summary: Upgrade to Tika 2.1
                 Key: NUTCH-2891
                 URL: https://issues.apache.org/jira/browse/NUTCH-2891
             Project: Nutch
          Issue Type: Improvement
          Components: parser, plugin
    Affects Versions: 1.18
            Reporter: Sebastian Nagel
            Assignee: Sebastian Nagel
             Fix For: 1.19


There's already the second release of Tika 2 ([2.1.0|https://tika.apache.org/2.1.0/index.html]). Following the [2.0 release notes|https://archive.apache.org/dist/tika/2.0.0/CHANGES-2.0.0.txt] and the [migration guide|https://cwiki.apache.org/confluence/display/TIKA/Migrating+to+Tika+2.0.0]:
* Tika 2 is more modular which should allow us to build a smaller parse-tika (66 MiB in the 1.18 binary package) by dropping rarely used parsers - but users should be able to include them if they build Nutch from the sources.
* the language-identifier plugin needs to be upgraded as well (in addition to Nutch core and the parse-tika plugin). This would include or overlap with NUTCH-2449.
* to avoid that the PDF parser times out we probably want to disable the OCR by default, or at least, provide the configuration snippet for this purpose





--
This message was sent by Atlassian Jira
(v8.3.4#803005)