You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Ken Krugler (JIRA)" <ji...@apache.org> on 2018/01/05 14:19:00 UTC

[jira] [Resolved] (TIKA-2539) TagSoup HTML parser is project EOL

     [ https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ken Krugler resolved TIKA-2539.
-------------------------------
    Resolution: Duplicate

> TagSoup HTML parser is project EOL
> ----------------------------------
>
>                 Key: TIKA-2539
>                 URL: https://issues.apache.org/jira/browse/TIKA-2539
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.16, 1.17
>         Environment: All
>            Reporter: Richard Jones
>
> The TagSoup HTML parser is project EOL, and the last update was to create the 1.2.1 version (that Tika references) back in Aug 2011.
> I cannot find any TagSoup forks that are still active but there are many alternative (and perhaps better if you believe the reviews and wikipedia comparisons) html parsers out there.
> Perhaps the most active is already pulled in by Tika as a transitive dependency of edu.ucar:grib, and that is jsoup with over 1,000 usages and updates as recent as a few months ago:
> https://mvnrepository.com/artifact/org.jsoup/jsoup
> https://jsoup.org/
> Requesting consideration of moving away from the long EOL'd TagSoup to an active and modern HTML parser like jsoup that is already a transitive Tika dependency.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)