You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "David Morana (JIRA)" <ji...@apache.org> on 2013/04/05 16:20:13 UTC

[jira] [Created] (TIKA-1102) Can we add
to the list of heuristics for bad html fragments?

David Morana created TIKA-1102:
----------------------------------

             Summary: Can we add <div> to the list of heuristics for bad html fragments?
                 Key: TIKA-1102
                 URL: https://issues.apache.org/jira/browse/TIKA-1102
             Project: Tika
          Issue Type: Improvement
    Affects Versions: 1.3, 1.2
         Environment: I'm using Solr 4.0 final with tika v1.2 and ManifoldCF v1.2dev all on tomcat 7.0.37
            Reporter: David Morana


Good morning,
Crawling legacy sites with poorly written html fragments causes severe Solr Xml parse errors and in turn causes ManifoldCF to abort.
Can we add <div> to the list of heuristics so the html parser is used instead of the xml parser?
see this ticket for further information: [TIKA-1101|https://issues.apache.org/jira/browse/TIKA-1101]

Thank you,

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira