You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@tika.apache.org by "Hudson (JIRA)" <ji...@apache.org> on 2014/12/22 06:42:14 UTC

[jira] [Commented] (TIKA-1490) Basic parser for old Excel files (eg Excel 4)

    [ https://issues.apache.org/jira/browse/TIKA-1490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14255452#comment-14255452 ] 

Hudson commented on TIKA-1490:
------------------------------

SUCCESS: Integrated in tika-trunk-jdk1.7 #380 (See [https://builds.apache.org/job/tika-trunk-jdk1.7/380/])
TIKA-1490 Unit tests for Excel 2-4 parser (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647242)
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/AbstractPOIContainerExtractionTest.java
* /tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/OldExcelParserTest.java
TIKA-1490 Parser for old Excel 2-4 files (nick: http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1647240)
* /tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/OldExcelParser.java
* /tika/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser


> Basic parser for old Excel files (eg Excel 4)
> ---------------------------------------------
>
>                 Key: TIKA-1490
>                 URL: https://issues.apache.org/jira/browse/TIKA-1490
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Nick Burch
>            Assignee: Nick Burch
>
> In TIKA-1487, we added mime magic for the pre-OLE2 excel file formats. Based on the reading of the OpenOffice Excel docs for that, it looks like it should be possible to produce a basic parser to extract key bits of info (eg strings) from these older file formats. 
> This would likely largely be done by having a custom record iterator for the older formats, then passing the handful of "interesting" records to POI's record classes (maybe with some tweaks for the older formats) to have the binary data parsed, then returned by the parser



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)