You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "chris hudson (JIRA)" <ji...@apache.org> on 2011/04/20 16:21:05 UTC
[jira] [Created] (TIKA-644) parsing of Microsoft Word doc with
style "Heading X" where X>6 creates invalid HTML with tags , etc
parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid HTML with tags <h7>,<h8> etc
-----------------------------------------------------------------------------------------------------------
Key: TIKA-644
URL: https://issues.apache.org/jira/browse/TIKA-644
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.9
Reporter: chris hudson
Priority: Minor
org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
changing line 380 from:
tag = "h"+num;
to
tag = "h"+Math.max(num, 6);
will resolve this.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (TIKA-644) parsing of Microsoft Word doc with
style "Heading X" where X>6 creates invalid HTML with tags , etc
Posted by "chris hudson (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
chris hudson updated TIKA-644:
------------------------------
Description:
org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
changing line 380 from:
tag = "h"+num;
to
tag = "h"+Math.min(num, 6);
will resolve this.
was:
org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
<!ENTITY % heading "h1|h2|h3|h4|h5|h6">
changing line 380 from:
tag = "h"+num;
to
tag = "h"+Math.max(num, 6);
will resolve this.
> parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid HTML with tags <h7>,<h8> etc
> -----------------------------------------------------------------------------------------------------------
>
> Key: TIKA-644
> URL: https://issues.apache.org/jira/browse/TIKA-644
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: chris hudson
> Priority: Minor
> Labels: doc, h, parser, tag
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
> <!ENTITY % heading "h1|h2|h3|h4|h5|h6">
> changing line 380 from:
> tag = "h"+num;
> to
> tag = "h"+Math.min(num, 6);
> will resolve this.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (TIKA-644) parsing of Microsoft Word doc with
style "Heading X" where X>6 creates invalid HTML with tags , etc
Posted by "Nick Burch (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Nick Burch resolved TIKA-644.
-----------------------------
Resolution: Fixed
Fix Version/s: 1.0
Assignee: Nick Burch
Good spot! Fixed in r1095429.
> parsing of Microsoft Word doc with style "Heading X" where X>6 creates invalid HTML with tags <h7>,<h8> etc
> -----------------------------------------------------------------------------------------------------------
>
> Key: TIKA-644
> URL: https://issues.apache.org/jira/browse/TIKA-644
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.9
> Reporter: chris hudson
> Assignee: Nick Burch
> Priority: Minor
> Labels: doc, h, parser, tag
> Fix For: 1.0
>
> Original Estimate: 5m
> Remaining Estimate: 5m
>
> org.apache.tika.parser.microsoft.WordExtractor will translate heading styles to "h" tags with a level greater than 6 which means the xhtml is invalid. The xhtml DTD only defines header elements 1 to 6:
> <!ENTITY % heading "h1|h2|h3|h4|h5|h6">
> changing line 380 from:
> tag = "h"+num;
> to
> tag = "h"+Math.min(num, 6);
> will resolve this.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira