You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2015/01/13 17:43:35 UTC

[jira] [Created] (TIKA-1515) Old XLS 3 parsing is not working

Tim Allison created TIKA-1515:
---------------------------------

             Summary: Old XLS 3 parsing is not working
                 Key: TIKA-1515
                 URL: https://issues.apache.org/jira/browse/TIKA-1515
             Project: Tika
          Issue Type: Bug
            Reporter: Tim Allison
            Priority: Minor


Thanks to [~gagravarr], we now have mime type id for excel.sheet.4 and excel.sheet.3, and we have parsing for excel.sheet.4.  It looks like there's are two issues with excel.sheet.3 parsing on most excel.sheet.3 files in govdocs1.

The predominant issue (169 out of 173) appears to stem from a bad/missing code page parse:
{noformat}
Caused by: java.lang.IllegalArgumentException: Unsupported codepage requested
	at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:83)
	at org.apache.poi.hssf.record.OldLabelRecord.getValue(OldLabelRecord.java:82)
	at org.apache.poi.hssf.extractor.OldExcelExtractor.getText(OldExcelExtractor.java:159)
	at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:82)
	at org.apache.tika.parser.microsoft.OldExcelParser.parse(OldExcelParser.java:76)
	at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:256)
	... 41 more
Caused by: java.io.UnsupportedEncodingException: Codepage number may not be -32767
	at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:275)
	at org.apache.poi.util.CodePageUtil.codepageToEncoding(CodePageUtil.java:253)
	at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:231)
	at org.apache.poi.util.CodePageUtil.getStringFromCodePage(CodePageUtil.java:219)
	at org.apache.poi.hssf.record.OldStringRecord.getString(OldStringRecord.java:81)
	... 46 more
{noformat}

The second issue only affects 4 documents.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)