You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2016/02/04 20:57:39 UTC

[jira] [Commented] (TIKA-1836) Convertion DOC->TXT failed due to POI issue

    [ https://issues.apache.org/jira/browse/TIKA-1836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15132895#comment-15132895 ] 

Tim Allison commented on TIKA-1836:
-----------------------------------

Committed workaround to log rather than throw an exception in POI r1728547.  Once the next version of POI is out and once we integrate that into Tika, this issue should be "fixed" at the Tika level.  The true fix would be to add parsing for that kind of record in POI...any takers?

> Convertion DOC->TXT failed due to POI issue
> -------------------------------------------
>
>                 Key: TIKA-1836
>                 URL: https://issues.apache.org/jira/browse/TIKA-1836
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.11
>         Environment: Distributor ID:	Ubuntu
> Description:	Ubuntu 12.04.5 LTS
> Release:	12.04
> Codename:	precise
> java version "1.7.0_91"
> OpenJDK Runtime Environment (IcedTea 2.6.3) (7u91-2.6.3-0ubuntu0.12.04.1)
> OpenJDK 64-Bit Server VM (build 24.91-b01, mixed mode)
>            Reporter: Jorge Spinsanti
>         Attachments: test.doc
>
>
> When we try to convert DOC -> TXT, I got the next stack trace:
> {code}
> Caused by: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@1ddeedb6
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:282)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	... 15 more
> Caused by: java.lang.UnsupportedOperationException: Non-extended character Pascal strings are not supported right now. Please, contact POI developers for update.
> 	at org.apache.poi.hwpf.model.Sttb.fillFields(Sttb.java:82)
> 	at org.apache.poi.hwpf.model.Sttb.<init>(Sttb.java:61)
> 	at org.apache.poi.hwpf.model.SttbUtils.readSttbSavedBy(SttbUtils.java:52)
> 	at org.apache.poi.hwpf.model.SavedByTable.<init>(SavedByTable.java:53)
> 	at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:361)
> 	at org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:144)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:146)
> 	at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:117)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
> 	... 22 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)