You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Björn Kautler (Jira)" <ji...@apache.org> on 2022/07/27 18:15:00 UTC

[jira] [Commented] (TIKA-1484) Boilerpipe dependency is evil

    [ https://issues.apache.org/jira/browse/TIKA-1484?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17572077#comment-17572077 ] 

Björn Kautler commented on TIKA-1484:
-------------------------------------

I analysed the code a bit.
As far as I have seen the situation is the following:
 - Boilerpipe is only used in {{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-commons/src/main/java/org/apache/tika/sax/boilerpipe/BoilerpipeContentHandler.java}}
 - {{BoilerpipeContentHandler}} is only used in
 -- {{tika-app/src/main/java/org/apache/tika/cli/TikaCLI.java}}
 -- {{tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java}}
 -- {{tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-html-module/src/test/java/org/apache/tika/parser/html/HtmlParserTest.java}}
 -- {{tika-server/tika-server-core/src/main/java/org/apache/tika/server/core/resource/TikaResource.java}}
 - {{tika-parser-html-commons}} consists solely of {{BoilerpipeContentHandler.java}}

So my suggestion would be to
 * move the test that tests {{BoilerpipeContentHandler}} to {{tika-parser-html-commons}}
 * remove the dependency from {{tika-parser-html-module}} to {{tika-parser-html-commons}}
 * add a dependency from {{tika-app}} to {{tika-parser-html-commons}} 
 * maybe even rename {{tika-parser-html-commons}} to {{tika-parser-boilerplate}} to also reduce the risk of it being used again in the future by the html module

This way {{tika-app}} and {{tika-server}} (which already has the explicit dependeny on {{{}tika-parser-html-commons{}}}) continue to work as before, but for users using Tika as library the Boilerpipe dependency vanishes.

It will be a breaking change for the unlikely situation where someone actually used it explicitly, but will be an improvement for almost everyone else.

> Boilerpipe dependency is evil
> -----------------------------
>
>                 Key: TIKA-1484
>                 URL: https://issues.apache.org/jira/browse/TIKA-1484
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Ben McCann
>            Priority: Major
>
> The Boilerpipe project bundles inside it two classes from org.cyberneko.html. We're already using NekoHTML in our project. Depending on which library shows up on our classpath certain parts of our project will either work or not. I'd really love it if Boilerpipe could be fixed or replaced with some other library that is a better citizen.
> I see I'm not the first person to run into this as another Tika user has filed a bug on the Boilerpipe project: https://code.google.com/p/boilerpipe/issues/detail?id=62



--
This message was sent by Atlassian Jira
(v8.20.10#820010)