You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Nick Burch (JIRA)" <ji...@apache.org> on 2015/01/13 23:15:35 UTC

[jira] [Commented] (TIKA-1515) Old XLS 3 parsing is not working on some documents

    [ https://issues.apache.org/jira/browse/TIKA-1515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14276053#comment-14276053 ] 

Nick Burch commented on TIKA-1515:
----------------------------------

Hopefully fixed in Apache POI in r1651517 - it seems Biff 2 and 3 use codepage values past the short negative wraparound number, which we weren't handling.

Any chance you could grab a nightly / svn build and see if that behaves nicely? If so, we can try to roll a new POI beta fairly soon for the fix

> Old XLS 3 parsing is not working on some documents
> --------------------------------------------------
>
>                 Key: TIKA-1515
>                 URL: https://issues.apache.org/jira/browse/TIKA-1515
>             Project: Tika
>          Issue Type: Bug
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: 081247.unk.xls
>
>
> Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there's are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in govdocs1.
> The predominant issue (169 out of 175) appears to stem from a bad/missing code page parse:
> {noformat}
> Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
> 	at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
> 	at org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
> 	at org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
> 	at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
> 	at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
> 	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
> 	... 41 more
> Caused by: java.io.UnsupportedEncodingException: Codepage number may not be -32767
> 	at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
> 	at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
> 	at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
> 	at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
> 	at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
> 	... 46 more
> {noformat}
> The second issue only affects 4 documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)