You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2018/01/05 14:19:00 UTC
[jira] [Resolved] (TIKA-2539) TagSoup HTML parser is project EOL
[ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ken Krugler resolved TIKA-2539.
-------------------------------
Resolution: Duplicate
> TagSoup HTML parser is project EOL
> ----------------------------------
>
> Key: TIKA-2539
> URL: https://issues.apache.org/jira/browse/TIKA-2539
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.16, 1.17
> Environment: All
> Reporter: Richard Jones
>
> The TagSoup HTML parser is project EOL, and the last update was to create the 1.2.1 version (that Tika references) back in Aug 2011.
> I cannot find any TagSoup forks that are still active but there are many alternative (and perhaps better if you believe the reviews and wikipedia comparisons) html parsers out there.
> Perhaps the most active is already pulled in by Tika as a transitive dependency of edu.ucar:grib, and that is jsoup with over 1,000 usages and updates as recent as a few months ago:
> https://mvnrepository.com/artifact/org.jsoup/jsoup
> https://jsoup.org/
> Requesting consideration of moving away from the long EOL'd TagSoup to an active and modern HTML parser like jsoup that is already a transitive Tika dependency.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)