You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Maxim Valyanskiy (JIRA)" <ji...@apache.org> on 2009/06/09 13:04:07 UTC
[jira] Created: (TIKA-244) Missing Header/Footer text for Word'97
documents
Missing Header/Footer text for Word'97 documents
------------------------------------------------
Key: TIKA-244
URL: https://issues.apache.org/jira/browse/TIKA-244
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 0.3
Reporter: Maxim Valyanskiy
Attachments: tika-patch
Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:
diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
--- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
+++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
@@ -75,9 +75,14 @@
} else if ("WordDocument".equals(name)) {
setType(metadata, "application/msword");
WordExtractor extractor = new WordExtractor(filesystem);
+
+ xhtml.element("p", extractor.getHeaderText());
+
for (String paragraph : extractor.getParagraphText()) {
xhtml.element("p", paragraph);
}
+
+ xhtml.element("p", extractor.getFooterText());
} else if ("PowerPoint Document".equals(name)) {
setType(metadata, "application/vnd.ms-powerpoint");
PowerPointExtractor extractor =
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Resolved: (TIKA-244) Missing Header/Footer text for Word'97
documents
Posted by "Jukka Zitting (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jukka Zitting resolved TIKA-244.
--------------------------------
Resolution: Fixed
Fix Version/s: 0.4
Assignee: Jukka Zitting
Thanks! Patch applied in revision 788595.
I added <div class="header"/> and <div class="footer"/> wrappers around the header and footer texts, and modified the code to only output those sections when the header or footer are non-empty.
> Missing Header/Footer text for Word'97 documents
> ------------------------------------------------
>
> Key: TIKA-244
> URL: https://issues.apache.org/jira/browse/TIKA-244
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: Maxim Valyanskiy
> Assignee: Jukka Zitting
> Fix For: 0.4
>
> Attachments: tika-patch
>
>
> Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:
> diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
> --- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
> +++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
> @@ -75,9 +75,14 @@
> } else if ("WordDocument".equals(name)) {
> setType(metadata, "application/msword");
> WordExtractor extractor = new WordExtractor(filesystem);
> +
> + xhtml.element("p", extractor.getHeaderText());
> +
> for (String paragraph : extractor.getParagraphText()) {
> xhtml.element("p", paragraph);
> }
> +
> + xhtml.element("p", extractor.getFooterText());
> } else if ("PowerPoint Document".equals(name)) {
> setType(metadata, "application/vnd.ms-powerpoint");
> PowerPointExtractor extractor =
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (TIKA-244) Missing Header/Footer text for Word'97
documents
Posted by "Maxim Valyanskiy (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/TIKA-244?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Maxim Valyanskiy updated TIKA-244:
----------------------------------
Attachment: tika-patch
> Missing Header/Footer text for Word'97 documents
> ------------------------------------------------
>
> Key: TIKA-244
> URL: https://issues.apache.org/jira/browse/TIKA-244
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.3
> Reporter: Maxim Valyanskiy
> Attachments: tika-patch
>
>
> Tika output lacks header/footer text for Word'07 document. This patch fixes this problem:
> diff -u -r apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java
> --- apache-tika-0.3/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-02-14 03:07:51.000000000 +0300
> +++ apache-tika-0.3-modified/src/main/java/org/apache/tika/parser/microsoft/OfficeParser.java 2009-06-09 13:24:56.000000000 +0400
> @@ -75,9 +75,14 @@
> } else if ("WordDocument".equals(name)) {
> setType(metadata, "application/msword");
> WordExtractor extractor = new WordExtractor(filesystem);
> +
> + xhtml.element("p", extractor.getHeaderText());
> +
> for (String paragraph : extractor.getParagraphText()) {
> xhtml.element("p", paragraph);
> }
> +
> + xhtml.element("p", extractor.getFooterText());
> } else if ("PowerPoint Document".equals(name)) {
> setType(metadata, "application/vnd.ms-powerpoint");
> PowerPointExtractor extractor =
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.